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Abstract 

Future civilian rescue and military operations will depend on a complex system of communi- 
cating devices that can operate in highly dynamic environments. In order to present a consistent 
view of a complex world, these devices will need to maintain data objects with atomic (lineariz- 
able) read/ write semantics. 

Lynch and Shvartsman have recently developed a reconfigurable atomic read/write memory 
algorithm for such environments [12, 13] This algorithm, called Rambo, guarantees atomic- 
ity for arbitrary patterns of asynchrony, message loss, and node crashes. Rambo installs new 
configurations lazily, transferring data from old configurations to new configurations using a 
background information transfer task. That task handles configurations sequentially, transfer- 
ring information from each configuration to the next. 

This paper presents a new algorithm, Rambo II, that implements a radically different ap- 
proach to installing new configurations: instead of operating sequentially, the new algorithm 
reconfigures "aggressively", transferring information from old configurations in parallel. This 
improvement substantially reduces the time necessary to remove obsolete configurations, which 
in turn substantially increases the fault-tolerance. This paper presents a formal specification of 
the new algorithm, a correctness proof, and a conditional analysis of its performance. Prelimi- 
nary empirical studies performed using LAN implementations of Rambo and the new algorithm 
illustrate the advantages of the new algorithm. 
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1 Introduction 

Future large scale civilian rescue and military deployment operations will involve large numbers of 
communication and computing devices operating in highly dynamic network substrates. Successful 
coordination and marshaling of human resources and equipment involves collecting information 
about a complex real-world situation using sensors and input devices, gathering the information in 
survivable repositories, and providing appropriate and coherent information to the stakeholders. 

Data objects with atomic (linearizable) read/write semantics commonly occur in such settings. 
Replication of objects is a prerequisite for fault-tolerance and availability, and with replication 
comes the need to maintain consistency. Additionally, in dynamic settings where participants may 
join and leave the environment, may fail, and where the physical objects migrate, one needs to be 
able to effectively move the corresponding data objects from one set of data owners to another. 

Lynch and Shvartsman developed a reconfigurable atomic read/write memory algorithm for dy- 
namic networks [12, 13]. The algorithm, called RAMBO, guarantees atomicity for arbitrary patterns 
of asynchrony, message loss, and node crashes. Conditional performance analysis of the algorithm 
shows that when the environment timing stabilizes, when failures are within specific parameters, 
and when the reconfigurations are not frequent and not bursty, then read and write operations 
have small latency bounded in terms of the maximum message delay and the periodic gossip inter- 
val. However when the reconfigurations are frequent or bursty, this algorithm may perform poorly 
because of the inherently sequential processing of the new configurations once they become deter- 
mined by the algorithm. In particular, the number of configurations maintained by the algorithm 
may grow without bound, leading to the unbounded number of messages necessary in processing 
the read and write operations. Such situations may arise due to failures or asynchrony, yet these 
are not the only reasons. Even in synchronous failure-free environments the world dynamics may 
require that frequent reconfigurations are performed to keep track of the rapidly moving physical 
objects or rapidly changing set of stakeholders. 

This paper presents a new algorithm, Rambo II, integrated with Rambo, that implements 
a radically different approach to installing new configurations: instead of operating sequentially, 
the new algorithm reconfigures "aggressively", transferring information from old configurations in 
parallel. This improvement substantially reduces the time necessary to process new configurations 
and to remove obsolete configurations from the system, which in turn substantially increases fault- 
tolerance. This is due to the fact that once a configuration is removed, the system no longer depends 
on it, and as soon as the configuration is removed, it is allowed to fail. The process executing the 
new algorithm achieves a linear speed-up in the number of old configurations known to the process. 
For example, our conditional performance analysis shows that if a process knows about a sequence 
of h configurations, then the it can eliminates all but one of these configurations in time 0(1), as 
compared to the original Rambo, where this takes @(h) time. Additionally, the new algorithm 
reduces the number of messages necessary to process these configurations 

This paper presents a formal specification of the new algorithm, a correctness proof, and a 
conditional analysis of its performance. Preliminary empirical studies performed using LAN imple- 
mentations of Rambo and the new algorithm illustrate the advantages of the new algorithm. 

Background. Starting with the work of Gifford [6] and Thomas [18], intersecting collections of 
sets found use in several algorithms providing consistent data in distributed settings. Depend- 
ing on the algorithm and its setting, such collections of sets, called quorums when any two have 
non-empty intersection, represent either sets of processors or their knowledge. Upfal and Wigder- 
son [19] use majority sets of readers and writers to emulate shared memory in a distributed setting. 



Vitanyi and Awerbuch [20] implement multi-writer/multi-reader registers using matrices of single- 
writer/single-reader registers where the rows and the columns are written and respectively read 
by specific processors. Attiya, Bar-Noy and Dolev [1] use majorities of processors to implement 
single-writer/multi-reader objects in message passing systems. Such algorithms assume a static 
processor universe and rely on static static quorum systems. 

In long-lived systems where processors may dynamically join and leave the system, it is impor- 
tant to reconfigure a quorum system to adapt it to the new set of processors [8, 4, 7, 17]. Prior 
approaches required that the new quorum system include processors from the old quorum system. 
This is stated as a static constraint on the quorum system that needs to be satisfied during or even 
before the reconfiguration. In our work on reconfigurable atomic memory [15, 5, 12] we replace 
the space-domain requirement on successive quorum system intersections with the time-domain re- 
quirement that some quorums from the old and the new system are involved in the reconfiguration 
algorithm. Such systems are more dynamic because they allow for more choices of new quorum 
systems and do not require that successive configurations intersect. 

Reconfiguration in Highly Dynamic Settings. Lynch and Shvartsman's earlier algorithms [15, 
5] allowed a single distinguished process to act as the quorum system reconfigurer. The advantage 
of the single-reconfigurer approach is its relative simplicity and efficiency: any process maintains at 
most two configurations, the current configuration and the proposed new configuration. The dis- 
advantage of the single reconfigurer is that it is a single point of failure - no further reconfiguration 
is possible if the reconfigurer fails. 

The RAMBO algorithm [12, 13] removed the requirement of having a single reconfigurer, thus 
enabling any process within its own current configuration to begin reconfiguration to a new quorum 
system supplied by the environment. The algorithm implements atomic shared memory suitable for 
use in highly dynamic settings, and it guarantees atomicity in any asynchronous execution and in 
the presence of arbitrary process and network failures. However the multiple-reconfigurer approach 
introduces the problem of maintaining multiple configurations and removing old configurations 
from the system. Rambo implements a sequential "garbage-collection" algorithm where processes 
remove obsolete configurations one-at-a-time. Configuration removal requires that information is 
propagated from the earliest known configuration to its successor. Since arbitrarily many new 
configurations may be introduced this leads to an unbounded number of old configurations that 
need to be sequentially removed. 

The environment may introduce new configurations for several reasons: (i) due to failures 
and network instability that endanger installed configurations, (ii) due to the mobility of the 
physical objects represented by the abstract memory objects and the mobility of the processes 
maintaining the object replicas, and (Hi) due to the need to rebalance loads on processes within 
installed configurations. Frequent or bursty reconfiguration can substantially increase the number 
of installed configurations and, since a process performing a read or a write operation potentially 
needs to contact quorums in all configurations known to it, this leads to the corresponding increase 
in the number of messages needed to perform the operation. 

The New Algorithm. The primary contribution of this paper is a new algorithm for reconfig- 
urable atomic memory, based on the original RAMBO, that implements an aggressive configuration- 
replacement protocol where any locally-known contiguous sequence of configurations is replaced by 
the last configuration in the sequence. The removal of the old configurations is done in parallel, 
while preserving all other properties of the original Rambo. Specifically, we maintain a loose cou- 
pling between the reconfiguration algorithms and the original Rambo algorithms implementing the 



read and write operations. 

In order to achieve availability in the presence of failures, the objects are replicated at several 
network locations. In order to maintain memory consistency in the presence of small and transient 
changes, the algorithm uses configurations, each of which consists of a set of members plus sets of 
read-quorums and write- quorums. In order to accommodate larger and more permanent changes, 
the algorithm supports reconfiguration, by which the set of members and the sets of quorums are 
modified. Such changes do not cause violations of atomicity. Any quorum configuration may be 
installed at any time — no intersection requirement is imposed on the sets of members or on the 
quorums of distinct configurations. 

The algorithm is composed of a main algorithm, which handles reading, writing, and replace- 
ment of old configurations with a successor configuration, and a global configuration announcement 
service, Recon, which provides the main algorithm with a consistent sequence of configurations. 
Several configurations may be known to the algorithm at one time, and read and write operations 
can use them all without any harm. 

The main algorithm performs read and write operations requested by clients using a two-phase 
strategy, where the first phase gathers information from read-quorums of active configurations 
and the second phase propagates information to write-quorums of active configurations. This 
communication is carried out using background gossiping, which allows the algorithm to maintain 
only a small amount of protocol state information. Each phase is terminated by a fixed point 
condition that involves a quorum from each active configuration. Different read and write operations 
may execute concurrently: the restricted semantics of reads and writes permit the effects of this 
concurrency to be sorted out afterward. 

The main algorithm provides a new configuration-replacement algorithm that removes old 
configurations while ensuring that their use is no longer necessary for maintaining consistency. 
Configuration-replacement also uses a two-phase strategy, where the first phase communicates in 
parallel with all old configurations being removed and the second phase communicates with a new 
configuration. A configuration-replacement operation ensures that both a read-quorum and a write- 
quorum of each old configuration learn about the new configuration, and that the latest value from 
all old configurations is conveyed to a write-quorum of the new configuration. The strength of 
the new algorithm is that it proceeds aggressively in parallel. An arbitrary number of old config- 
urations can be replaced in constant time (assuming bounded message latency and non-failure of 
active configurations). 

The configuration announcement service is implemented by a distributed algorithm that uses 
distributed consensus to agree on the successive configurations. Any member of the latest config- 
uration c may propose a new configuration at any time; different proposals are reconciled by an 
execution of consensus among the members of c. Consensus is, in turn, implemented using a version 
of the Paxos algorithm [9], as described formally in [3]. Although such consensus executions may 
be slow — in fact, in some situations, they may not even terminate — they do not cause any delays 
for read and write operations. 

All services and algorithms, and their interactions, are specified using I/O automata. We 
show correctness (atomicity) of the algorithm for arbitrary patterns of asynchrony and failures. 
On the other hand, we analyze performance conditionally, based on certain failure and timing- 
assumptions. For example, assuming that gossip and configuration-replacement occur periodically, 
and that quorums of active configurations do not fail, we show that read and write operations 
complete within time 8d, where d is the maximum message latency. Note that the original Rambo 
algorithm also had to assume also that garbage-collection is able to keep up — this assumption is 
not necessary in the new algorithm due to the new configuration replacement algorithm. For the 



configuration replacement algorithm we show that any number of configurations can be replaced 
by their successor in constant time. 

At the same time, all the performance results of the original Rambo algorithm still hold; in 
instances where the network is reliable and timely throughout the execution, the bounds described 
in the previous Rambo papers [12, 13] still hold. 

Implementations of Rambo and Rambo II on a LAN are currently being completed [16]. 
Preliminary empirical studies performed using this implementation illustrate the advantages of the 
new algorithm over the previous one. 

Document Structure. In Section 2 we describe the original Rambo algorithm of Lynch and 
Shvartsman, and then in Section 3 present and discuss the formal specification of Rambo II. In 
Section 4 we present some notation, and restate some basic lemmas, only slightly modified from 
Rambo. In Section 5 we prove that the new algorithm guarantees atomic consistency. In Section 6 
we present the reconfiguration service. In Section 7 we analyze the performance of Rambo II, and 
discuss in detail the areas in which this algorithm improves over the original RAMBO algorithm. In 
Section 8 we discuss the preliminary performance results. Finally, in Section 9 we summarize the 
results, and areas for future research. 

2 The Original Rambo Algorithm 

In this section, we present the original RAMBO algorithm, on which the new algorithm RAMBO II 
is based. Rambo is an algorithm designed to support read/write operations on an atomic shared 
memory. 

In order to achieve fault tolerance and availability, Rambo replicates data at several network 
locations. The algorithm uses configurations to maintain consistency in the presence of small and 
transient changes. Each configuration consists of a set of members plus sets of read-quorums and 
write- quorums. The quorum intersection property requires that every read-quorum intersect every 
write-quorum. Read and write operations are implemented as a two-phase protocol, in which each 
phase accesses a set of read or write quorums. 

Rambo supports reconfiguration, which modifies the set of members and the sets of quorums, 
thereby accommodating larger and more permanent changes without violating atomicity. In this 
way, failed nodes can be removed from active quorums, and newly joined nodes can be integrated 
into the system. Any quorum configuration may be installed at any time - no intersection require- 
ment is imposed on the sets of members or on the quorums of distinct configurations. 

The Rambo algorithm consists of three kinds of automata: 

• Joiner automata, which handle join requests, 

• Recon automata, which handle reconfiguration requests, and generate a totally ordered se- 
quence of configurations, and 

• Reader-Writer automata, which handle read and write requests, manage garbage collection, 
and send and receive gossip messages. 

In this paper, we discuss only the Reader- Writer automaton. The Joiner automaton is quite 
simple; it sends a join message when node i joins, and sends a join-ack message in response to join 
messages. The Recon automaton depends on a consensus service, implemented using Paxos [9], to 
agree on a total ordering of configurations. However, we assume that this total ordering exists, and 



therefore need not discuss this automaton any further. For more details of these two automata, see 
the original Rambo paper [12, 13]. 

The complete implementation S is the composition of all the automata described above — the 
Joiner ■$, Reader- Writer $, and Recorii automata for all i, and all the channels, with all the actions 
that are not external actions of the Rambo specification hidden. 

Input: Output: 

join(rambo, J) x ,i, J a finite subset of I — {i}, x 6 X, i G I, join-ackf/ambo)^, x £ X, i £ I 

such that if i = (io)x then J = read-ackfv)^, v 6 V x , x 6 X, i 6 I 

read x ,i, x 6 X, i 6 I write-ack^j, x 6 X, i 6 I 

wr\te(v) x ,i, v € V x , x € X, i € I recon-ack(6) :Cj j, b 6 {ok, nok}, x 6 X,i 6 I 

recon(c, c') x ,i, c,c € C, i € members(c), x 6 X, i G I reportfc)^, c €C,c£ X,i £ I 
failj, i € I 

Figure 1: RAMBO(rr): External signature 

The external signature for Rambo appears in Figure 1. The algorithm is specified for a single 
memory location, and extended to implement a complete shared memory. A client uses the join i 
action to join the system. After receiving a join-ackj, the client can issue readj and write^ requests, 
which results in read-ackj and write-ackj responses. The client can issue a recotij request to propose 
a new configuration. Finally, the failj action is used to model node i failing. 

The signature and state for the Reader- Writer automata is presented in Figure 2. The code 
for the Reader-Writer automata is presented in Figure 3. All three operations, read, write, and 
garbage- collect, are implemented using gossip messages. Unlike in many other algorithms, there are 
no directed messages specified in this algorithm; at no point does a given node, say i, decide to send 
a message specifically to node j. Instead, at regular intervals node i will non-deterministically send 
all of its public state to other nodes. Progress in an operation occurs when enough information 
has been exchanged. After initiating an operation, the automaton waits until it can be sure that it 
has shared state with enough other nodes (using gossip messages), and then declares the operation 
complete. The phase numbering regime, implemented using pnuml and pnum2 is used to determine 
when enough communication has completed. 

Every node maintains a tag and a value for the data object. Every new value is assigned a 
unique tag, with ties broken by process-ids. These tags are used to determine an ordering of the 
write operations, and therefore determine the value that a read operation should return. 

Read and write operations require two phases, a query phase and a propagation phase, each 
of which accesses certain quorums of replicas. Assume the operation is initiated at node i. See 
Figure 5 for a summary of the two phases. First, in the query phase, node i contacts read quorums 
to determine the most recent available tag and value. Then, in the propagation phase, node i 
contacts write quorums. If the operation is a read operation, the second phase propagates the 
largest tag discovered in the query phase, and its associated value. If the operation is a write 
operation, node % chooses a new tag, strictly larger than every tag discovered in the query phase 
and propagates the new tag and the new value to the write quorums. Note that every operation 
accesses both read and write quorums. 

During a phase of an operation, whenever node i receives a gossip message from node j, it 
compares the largest phase number j has received from i (by examining pns) to the local phase 
number when the operation began. If j initiated the gossip message after receiving a message from 
i sent after the phase began, then i adds j to the ace set. In effect, there has been a round-trip 
message sent from i to j back to i. Also, i then updates its op.cmap if necessary. 

Garbage collection operations remove old configurations from the system. A garbage collection 



Signature: 



Input: 
readi 

write(w)i, v G V 
new-config(c, k)i, c£Ci6N + 
recv(join)j,i, j G I - {i} 
recv(m)j,i, m G M, j € I 
join(rw)i 
failj 

Output: 
join-ack(rw), 
read-ack(w)j, v G V 
write-ackj 
send(m)i,j, m G M, j G 7 

State: 

status G {idle, joining, active, failed}, initially idle 

world, a finite subset of 7, initially 

value G V, initially vo 

tag G T, initially (0,io) 

cmap G CMap, initially emap(O) = Co, 

cmap(k) = _L for & > 1 
pnuml G N, initially 
pnum2, a mapping from 7 to N, initially 

everywhere 
failed, a Boolean, initially false 



Internal: 
query-fix^ 
prop-fix^ 
gc(fc)i, i£N 
gc-query-fix(&)j, & 6 N 
gc-prop-fix(&)j, k € N 
gc-ack(&)i, & 6 N 



op, a record with fields: 
type G {read, write} 

phase 6 {idle, query, prop, done}, initially idle 
pnum 6 N 
cmap 6 CMap 
ace, a finite subset of 7 
vafae 6 V 

gc, a record with fields: 

phase G {idle, query, prop}, initially idle 

pnum G N 

ace, a finite subset of 7 

cmap G CMap 

index G N 



Figure 2: Reader- Writer f Signature and state 



operation involves two configurations: the old configuration being removed and the new config- 
uration being established. See Figure 6 for a summary of the two phases. A garbage collection 
operation requires two phases, a query phase and a propagation phase. The first phase contacts 
a read-quorum and a write-quorum from the old configuration, and the second phase contacts a 
write-quorum from the new configuration. 

Note that, unlike a read or write operation, the first phase of the garbage-collection operation 
must contact two types of quorums: a read-quorum and a write-quorum for the configuration being 
garbage-collected. This ensures that enough nodes are aware of the new configurations, and ensures 
that any ongoing read/write operations will include the new, larger, configuration. 

The cmap is a mapping from integer indices to configurations U{_L, ±}, that initially maps every 
index to _L. The cmap tracks which configurations are active, which are not defined, indicated by 
_L, and which are removed, indicated by ±. The total ordering on configurations determined by 
the Recon automata ensures that all nodes agree on which configuration is stored in each position 
in the array. We define c(k) to be the configuration associated with index k. 

The record op stores information about the current phase of an ongoing read or write operation, 
while gc stores information about an ongoing garbage collection operation. (A node can process 



Output send((W^ v, t, cm, pns, pnr)) ij 
Precondition: 



status = active 

j G world 

(W, v,t, cm, pns, pnr) = 

(world, value, tag, cmap, pnuml , pnum2(j)) 
Effect: 
none 

Input recv((W,v,t,cm,pns,pnr))j,i 
Effect: 

if -^failed then 
if status ^ idle then 
status «— active 
world <— world U W 
if t > tag then (value, tag) <— (v,t) 
cmap <— update(cmap , cm) 
pnurn2(j) <— max(pnum2 (j) , pns) 
if op. phase € {query, prop} and pnr > op.pnum then 
op. cmap <— extend (op. cmap, truncate(cm)) 
if op. cmap € Truncated then 

op. ace <— op. ace U {j} 
else 

op. ace <— 

op. cmap <— truncate(cmap) 
if gc. phase € {query, prop} and pnr > gc.pnum then 
gc.acc «— gc.acc U {j} 

Input new-config(c, ft)i 
Effect: 

if -^failed then 
if status ^ idle then 
crnap(k) <— update(cmap(k),c) 

Input read,, 
Effect: 

if -^failed then 
if status ^ idle then 
pnuml <— pnuml + 1 

(op.pnum, op.type, op. phase, op. cmap, op. ace) 
<— (pnuml , read, query, truncate(cmap) , 0) 

Input write(w)i 
Effect: 

if -^failed then 
if status ^ idle then 
pnuml <— pnuml + 1 

(op.pnum, op.type, op. phase, op. cmap, op. ace, op. value) 
<r- (pnuml , write, query, truncate(cmap) , 0, w) 



Internal query- 


fix,, 


Precondition: 




^failed 




status = active 


op.type £ { 


read, write} 


op. phase = 


query 


\/k€N,c€ 


C : op.cmap(k) = c 


=>3i?6 


read- quorums (c) : i? C op. ace 


Effect: 




if op.type = 


- read then op. value <— value 


else Dafae ■ 


<— op. value 


tag <- 


(tag.seq + 1, i) 


pnuml <— pnuml + 1 


op.pnum «- 


- pnuml 


op. phase <- 


■ prop 


op. cmap <— 


truncate(cmap) 


op. ace <— 




Internal prop-fix^ 


Precondition: 





status = active 

op.type 6 {read, write} 

op. phase = prop 

Vft 6 N, c 6 C : op.cmap(k) = c 

=$■ 3W 6 write- quorums (c) : PF C op. ace 
Effect: 

op. phase = done 

Output read-ack(w)j 
Precondition: 

-^failed 

status = active 

op.type = read 

op. phase = done 

v = op. value 
Effect: 

op. phase = idle 

Output write-ackj 
Precondition: 

-^failed 

status = active 

op.type = write 

op. phase = done 
Effect: 

op. phase = idle 



Figure 3: Reader- Writer f Read/write transitions 



Internal gc(k)i 

Precondition: 
-^failed 

status = active 
gc. phase = idle 
cmap(k) G C 
cmap(k + 1) G C 
k = or cmap(k 

Effect: 

pnuml * 
gc.pnum 
gc. phase 
gc.acc «- 
gc. index 



Internal gc-prop-fix(fc)j 
Precondition: 



1) 



pnuml + 

- pnuml 

- query 
) 
-k 



status = active 
gc. phase = prop 
gc. index = k 

3W € write- quorums (cmap(k + 1)) : W C gc.acc 
Effect: 

cmap(k) <— ± 

Internal gc-ack(Ai), 
Precondition: 



Internal gc-query-fix(fc)j 
Precondition: 



status = active 
gc. index = k 
crnap(k) = ± 
Effect: 

gc. phase = idle 



status = active 

gc. phase = query 

gc. index = k 

crnap(k) ^ ± 

3i? € read-quorums (cmap(k)) : 

3W € write- quorums (cmap(k)) 
RU W C gc.acc 
Effect: 

pnuml <— pnuml + 1 
gc.pnum <— pnuml 
gc. phase <— prop 
gc.acc <— 



Figure 4: Reader- Writer f. Rambo Garbage-collection transitions 

read and write operations even when a garbage collection operation is ongoing.) The op.cmap 
subfield records the configuration map for an operation. This consists of the node's cmap when 
a phase begins, augmented by any new configurations discovered during the phase. A phase can 
complete only when the initiator has exchanged information with quorums from every non-removed 
configuration in op.cmap. The pnum subfield records the phase number when the phase begins, 
allowing the initiator to determine which responses correspond to the current phase. The ace 
subfield records which nodes from which quorums have responded during the current phase. 

In Rambo, configurations go through three phases: proposal, installation, and upgrade. First, 
a configuration is proposed by a recon event. Next, if the proposal is successful, the Recon service 
achieves consensus on the new configuration, and notifies participants with decide events. When 
every non-failed member of the previous configuration has been notified, the configuration is in- 
stalled. The configuration is upgraded when every configuration with a smaller index has been 
removed at some process in the system. Once a configuration has been upgraded, it is responsible 
for maintaining the data. 

3 Formal Specification of Rambo II 

In this section we present the new algorithm in detail, and discuss how it differs from the RAMBO 
algorithm. The complete implementation, 5, is the composition of all the automata described — the 
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Operation initiated by readj or write(w)i 
Phase 1 : 

• Node i communicates with a read-quorum from each configuration in op.cmap in order to determine the 
largest value/tag pair. 



Phase 2 



• Node i communicates with a write-quorum from each configuration in in op.cmap to notify it of the 
current largest value/tag pair (or the new value/tag pair, if it is a write operation). 

Figure 5: Summary of two phase read or write operation 



Joiner \ and Recorii automata described in Rambo, the new Reader- Writer ; automaton described 
here, for all i, and all the channels - with all the actions that are not external actions of the Rambo 
II specification hidden. 

The key problem that prevents rapid stabilization in the original algorithm is the sequential 
nature of the configuration upgrade mechanism: in RAMBO, configurations are upgraded one at 
a time, in order. (Recall that in Rambo, a configuration is upgraded when every configuration 
with a smaller index has been garbage collected.) Configuration c(k) can be upgraded only if 
configuration c(k — 1) has previously been upgraded. This requirement arises from the need to 
ensure that information is preserved as configurations are changed. As in Rambo, a configuration 
in Rambo II is upgraded when every configuration with a smaller index has been removed at some 
process in the system. Rambo II, however, implements a new reconfiguration protocol that can 
upgrade any configuration, even if configurations with smaller indices have not been upgraded. 
Unlike in Rambo, then, there may be configurations that are not upgraded until they themselves 
are removed, at the same instant that some configuration with a larger index is upgraded. 

After Rambo II completes an upgrade operation for some configuration, all configurations 
with smaller indices can be removed. Thus a single upgrade operation in Rambo II potentially 
has the effect of many garbage collection operations in Rambo, each of which can only remove 
a single configuration. The name has been changed to emphasize the operation's active role in 
configuration management: configuration upgrade is an inherent part of preparing a configuration 
to assume responsibility for the data. The code for the new configuration management mechanism 

Operation initiated by gc(/c), 
Phase 1 : 

• Node i communicates with a read-quorum from configuration c(k) in order to determine the largest 
value/tag pair. 

• Node i communicates with a write-quorum from configuration c(k) in order to notify it of configuration 

k + 1. 



Phase 2 



• Node i communicates with a write-quorum from configuration c(k + 1) to notify it of the current largest 
value/tag pair. 

Figure 6: Summary of two phase garbage-collection operation 
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Signature: 

As in Rambo, with the following modifications: 
Internal: 

cfg-upgrade(fc)i, k G N >0 

cfg-upg-query-fix(fc)j, k G N >0 

cfg-upg-prop-fix(&),, k G N >0 

cfg-upgrade-ack(fc),, k G N >0 



Configuration Management Transitions: 



(A) 
(B) 

(C) 



(D) 

(E) 

(F) 
(G) 



(H) 

(I) 
(J) 



Internal cfg-upgrade(&)i 
Precondition: 



status = active 
upg. phase = idle 
cmap(k) G C 
cmap(k — 1) € C 1 
\/£€N,£<k: cmap(£) ^ _L 
Effect: 

pnuml <— pnurnl + 1 

upg <— (query, pnuml , cmap, 0, k) 

Internal cfg-upg-query-fix(fc)j 
Precondition: 



Configuration Management State: 

As in Rambo, with the following replacing the gc 

record: 

upg, a record with fields: 

phase G {idle, query, prop}, initially idle 

pnum G N 

cmap G CMap, 

ace, a finite subset of / 

target G N 



Internal cfg-upgrade-ack(Ai), 
Precondition: 



status = active 
upg. target = k 

\/l€N,£<k: cmap{£) = ± 
Effect: 

upg. phase = idle 

Output send( (W,v,t, cm, pns,pnr))i j 
Precondition: 



status = active 
upg. phase = query 
upg. target = k 

W G N, £ < k : upg.cmap{£) G C 
=$■ 3R G read- quorums (upg. cmap (£)) : 
3W G write-quorums(upg.cmap(£)) : 
R U W C upg. ace 
Effect: 

pnuml <— pnuml + 1 
upg. pnum <— pnuml 
upg. phase <— prop 
upg. ace <— 

Internal cfg-upg-prop-fix(fc)j 
Precondition: 

-^failed 

status = active 

upg. phase = prop 

upg. target = k 

3W G write- quorums (upg. cmap (k)) : W C upg. ace 
Effect: 

for £ G N : £ < k do 
cmap(£) <— ± 



status = active 

j G world 

(W, v,t, cm, pns, pnr) = 

(world, value, tag, cmap, pnuml , pnum2 (j)) 
Effect: 
none 

Input rec\i((W, v, t, cm, pns, pnr))j,i 
Effect: 

if -^failed then 
if status ^ idle then 
status «— active 
world <— world U W 
if t > tag then (value, tag) <— (v, t) 
cmap <— update(cmap,cm) 
pnum2(j) <— max(pnum2(j),pns) 
if op. phase G {query, prop} and pnr > op. pnum then 
op. cmap <— extend(op.cmap,truncate(cm)) 
if op. cmap G Truncated then 

op. ace <— op. ace U {j} 
else 

op. ace <— 

op. cmap <— truncate(cmap) 
if upg. phase G {query, prop} and pnr > upg. pnum then 
upg. ace <— upg. ace U {j} 



Figure 7: Reader- Writer f. Configuration Management transitions 
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appears in Figure 7. All labeled lines in this section refer to the code therein. 

We now describe in more detail the configuration upgrade operation, which is at the heart of 
Rambo II. A configuration upgrade is a two-phase operation, much like the garbage-collection 
operation in Rambo. See Figure 8 for a summary of the two phases. An upgrade operation is 
initiated at node i with a cfg-upgrade(&) event. When this happens, cmap(k) must be defined, that 
is, must be a valid configuration € C (line A). Additionally, for every configuration £ < k, cmap(l) 
must be either € C or removed, that is, ± (line B). 

We refer to configuration c(k) as the target of the upgrade operation, and we refer to the set 
of configurations to be removed, {c(£) : £ < k A upg.cmap(£) € C}, as the removal-set of the 
configuration upgrade operation. The configuration management mechanism guarantees that the 
removal-set consists of configurations with a contiguous set of indices. 

As a result of the cfg-upgrade event, node i initializes its upg state (line C), and begins the 
query phase of the upgrade operation. In particular, node i stores its current cmap in upg. cmap, 
which records the configurations that are currently active. Only these configurations (and, in fact, 
only those with index smaller than k) matter during the operation; new configurations are ignored. 

The query phase continues until node i receives responses from enough nodes. In particular, 
for every configuration c(£) with index less than k in upg. cmap, there must exist a read-quorum, 
R, of configuration c(£), and a write-quorum, W, of configuration c(£) such that i has received a 
response (that is, a recent gossip message) from every node in R U W (lines D-E). 

When the query phase completes, a cfg-upg-query-fix event occurs. When this event occurs, 
node i then has the most recent tag and value discovered by operations using configurations with 
index smaller than k. Further, all configurations with indices smaller than k have been notified of 
configuration c(k). Node i then reinitializes upg to begin the propagation phase (lines F-G). 

The propagation phase continues until node % receives responses from a write-quorum in con- 
figuration c(k). In particular, there must exist a write-quorum, W, of configuration c(k), such that 
i has received a response from every node in W (line H). 

When the propagation phase completes, a cfg- upg- prop-fix event occurs, which verifies the ter- 
mination condition. At this point node i has ensured that configuration c(k) has received the most 
recent value known to i, which, as a result of the query phase, is itself a recent value. At this point, 
the configurations with index < k are no longer needed, and node i removes these configurations 
from its local cmap, setting cmap(£) = ± for all £ < k (line I-J). Gossip messages may eventually 
notify other processes that these configurations have been removed. 

Finally, a cfg-upgrade-ack(&) event notifies the client that configuration c(k) has been success- 
fully upgraded. 

Notice that the algorithm allows a nondeterministic choice of which configuration to upgrade 
- and therefore which configurations to remove. Therefore it is possible to restrict the algorithm 
so that it removes only the smallest configuration, upgrading the configurations one at a time. In 
this case the algorithm progresses exactly as the original Rambo algorithm. Therefore it is clearly 
possible, by restricting the nondeterminism appropriately, to implement RAMBO II in such a way 
as to guarantee equivalent performance as Rambo. However we will show that by allowing greater 
flexibility we can achieve equivalent safety results and improved performance. 

The new algorithm introduces several difficulties not present in Rambo. Consider, for example, 
a nice property guaranteed by the sequential garbage collection algorithm in RAMBO: every con- 
figuration is upgraded before it is removed. In Rambo II, on the other hand, some configurations 



1 In the conference version of the paper, this line was omitted. The removal of this line has no detrimental effect 
on the algorithm, since the operation then completes in zero time. However for clarity sake it is included. 
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Operation initiated by cfg-upgrade(ft)j: 
Phase 1 : 

• Node i communicates with a read-quorum from each configuration being removed in order to determine 
the largest value/tag pair. 

• Node i communicates with a write-quorum from each configuration being removed to notify it of the 
new, active configuration. 



Phase 2 



• Node i communicates with a write-quorum from the target configuration being upgraded, to notify it of 
the current largest value/tag pair. 

Figure 8: Summary of two phase configuration upgrade operation 



never receive up to date information; a configuration may be upgraded at the same instant it is 
removed. 

As a result of this fact, a number of plausible improvements fail. Assume that during an 
ongoing upgrade operation for configuration c(k) initiated by node i, node i receives a message 
indicating that configuration c(k') has been removed, for some k' < k. In Rambo II, node % sets 
cmap(k') = ±, but does not change upg.cmap. Consider the following incorrect modification to the 
configuration management mechanism. When node i receives such a message, it sets upg.cmap(k') 
to ±. Since the configuration has been removed, it seems plausible that the configuration upgrade 
operation can safely ignore it, thus completing more quickly. It turns out, however, that this 
improvement results in a race condition that can lead to data loss. The configuration upgrade 
operation that removes configuration c(k') might occur concurrently with the operation at node 
i upgrading configuration c(k). This concurrency might result in data being propagated from 
configuration c(k') to a configuration c(k") : k' < k" < k that has already been processed by the 
upgrade operation at node i. The data thus propagated might then be lost. 

4 Notation and Basic Lemmas 

This section is, to a large extent, a restatement of notation and results from the original Rambo 
paper [13]. Some of the notation in the proofs has been slightly modified to account for the new 
configuration management mechanism, and some of the proofs have therefore been updated, but 
the results are essentially unchanged. Much of this section is taken directly from [13]. 

4.1 Good Executions 

Throughout the rest of this paper, we will talk about "good" executions of the algorithm. In this 
section, we present a set of environment assumptions that define a "good" execution. In general, 
the assumptions we will present require well-formed requests: clients follow the protocol to join and 
to initiate reconfigurations; clients initiate only one operation at a time; clients wait for appropriate 
acknowledgments before proceeding. 

We consider executions of S (recall that S is the entire system combining Reader- Writer, Recon 
and Joiner automata) whose traces satisfy certain assumptions about the environment. We call 
these good executions. In particular, an "invariant" is a statement that is true of all states that 
are reachable in good executions of S. The environment assumptions are simple "well-formedness" 
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conditions: 

• Well-formedness for Reader- Writer: 

— For every x and i: 

* No join(rambo, *) x ,i, read^j, wr\te(*) X: i, or recon(*, *) x ^ event is preceded by a failj 
event. 

* At most one join(rambo, *) x ^ event occurs. 

* Any read^j, write^)^, or recon(*, *) x ,i event is preceded by a join-ack^ambo)^ 
event. 

* Any read^j, write^)^, or recon(*, *) x ,i event is preceded by an -ack event for any 
preceding event of any of these kinds. 

— For every x and c, at most one recon(*, c) x ^ event occurs. (This says that configuration 
identifiers that are proposed in recon events are unique. It does not say that the mem- 
bership and/or quorum sets are unique — just the identifiers. The same membership 
and quorum sets may be associated with different configuration identifiers.) Unique- 
ness of configuration identifiers is achievable using local process identifiers and sequence 
numbers. 

— For every c, c' , x, and i, if a recon (c, d) x ,i event occurs, then it is preceded by: 

* A report(c) a ; ) i event, and 

* A join-ack(rambo);,; ; j event for every j G members(c'). 

• Well-formedness for Recon: 2 

— For every i: 

* No join(recon)j or recon(*, *)j event is preceded by a failj event. 

* At most one join(recon)j event occurs. 

* Any recon(*,*),j event is preceded by a join-ack(recon),; event. 

* Any recon(*, *)j event is preceded by an -ack for any preceding recon(*, *)$ event. 

— For every c, at most one recon(*,c)* event occurs. 

— For every c, c', x, and i, if a recon (c, c')i event occurs, then it is preceded by: 

* A report(c)j event, and 

* A join-ack(recon)j for every j G members(c'). 

4.2 Notational conventions 

In this section, we introduce some definitions and notational conventions, and we add certain history 
variables to the global state of the system S. 
Definitions: 

• update, a binary function on C±, defined by update(c, c') = max(c, c') if c and c' are compa- 
rable (in the augmented partial ordering of C±), update(c, c') = c otherwise. 

• extend, a binary function on C±, defined by extend(c,c') = c' if c = _L and c' G C, and 
extend(cc') = c otherwise. 



2 The following properties appear in Section 6, but we repeat them here for completeness. 
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• CMap, the set of configuration maps, defined as the set of mappings from N to C±. The 
update and extend operators are extended element-wise to binary operations on CMap. 

• truncate, a unary function on CMap, defined by truncate (cm) (k) = _L if there exists £ < k 
such that cm(£) = _L, truncate (cm) (k) = cm(k) otherwise. This truncates configuration map 
cm by removing all the configuration identifiers that follow a _L. 

• Truncated, the subset of CMap such that cm G Truncated if and only if truncate(cm) = cm. 

• Usable, the subset of CMap such that cm G Usable iff the pattern occurring in cm consists 
of a prefix of finitely many ±s, followed by an element of C, followed by an infinite sequence 
of elements of C U {_L} in which all but finitely many elements are _L. 

An operation is a pair (n, i) consisting of a natural number n and an index i£l. Here, i is the 
index of the process running the operation, and n is the value of pnuml i just after the read, write, 
or cfg-upgrade event of the operation occurs. 

We introduce the following history variables: 

• in-transit, a set of messages, initially 0. 

A message is added to the set when it is sent by any Reader- Writer i to any Reader-Writer j . 
No message is ever removed from this set. 

• For every k E~N: 

1. c(k) G C, initially undefined. 

This is set when the first new-config(c, k)i occurs, for some c and i. It is set to the c that 
appears as the first argument of this action. 

• For every operation n: 

1. tag (it) G T, initially undefined. 

This is set to the value of tag at the process running n, at the point right after 7r's query-fix 
or cfg-upg-query-fix event occurs. If n is a read or configuration upgrade operation, this 
is the highest tag that it encounters during the query phase. If n is a write operation, 
this is the new tag that is selected for performing the write. 

• For every read or write operation n: 

1. query- cmap(n), a CMap, initially undefined. 

This is set in the query-fix step of n, to the value of op.cmap in the pre-state. 

2. R(n, k), for k G N, a subset of /, initially undefined. 

This is set in the query-fix step of n, for each k such that query- cmap(n)(k) G C. It is 
set to an arbitrary R G read- quorums (c(k)) such that R C op. ace in the pre-state. 

3. prop-cmap(n), a CMap, initially undefined. 

This is set in the prop-fix step of n, to the value of op.cmap in the pre-state. 

4. W(ir, k), for k G N, a subset of /, initially undefined. 

This is set in the prop-fix step of n, for each k such that prop-cmap(n)(k) G C. It is set 
to an arbitrary W G write- quorums (c(k)) such that W C op. ace in the pre-state. 



For every configuration upgrade operation 7 for k: 
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1. removal- set (7), a subset of N, initially undefined. 

This is set in the cfg-upgrade step of 7, to the set {£ : £ < k, cmap(£) ^ ±}. 

2. i?(7, £), for £ G N, a subset of /, initially undefined. 

This is set in the cfg-upg-query-fix step of 7, for each £ G removal- set (7), to an arbitrary 
i? G read- quorums (c(£)) such that i? C upg.acc in the pre-state. 

3. VFi(7, ^), for £ G N, a subset of /, initially undefined. 

This is set in the cfg-upg-query-fix step of 7, for each £ G removal-set^), to an arbitrary 
W G write- quorums (c(£)) such that VF C upg.acc in the pre-state. 

4. ^2(7), a subset of /, initially undefined. 

This is set in the cfg-upg- prop-fix step of 7, to an arbitrary W G write- quorums (c(k)) 
such that W C upg.acc in the pre-state. 

In any good execution a, we define the following events (more precisely we are giving additional 
names to some existing events): 

1. For every read or write operation n: 

(a) query-phase-start(-7r) , initially undefined. 

This is defined in the query-fix step of n, to be the unique earlier event at which the 
collection of query results was started and not subsequently restarted. This is either a 
read, write, or recv event. 

(b) prop-phase-start(7r), initially undefined. 

This is defined in the prop-fix step of n, to be the unique earlier event at which the 
collection of propagation results was started and not subsequently restarted. This is 
either a query-fix or recv event. 

4.3 Configuration map invariants 

In this section, we give invariants describing the kinds of configuration maps that may appear in 
various places in the state of S. We begin with a lemma saying that various operations yield or 
preserve the "usable" property: 

Lemma 4.1 1. If cm, cm' G Usable then update (cm, cm') G Usable. 

2. If cm G Usable, k G N , c G C, and cm' is identical to cm, except that cm'(k) = update(cm(k), c), 
then cm,' G Usable. 

3. If cm,, cm,' G Usable then extend (cm, cm') G Usable. 
4- If cm G Usable then truncate(cm) G Usable. 

Proof. Part 1 is shown using a case analysis based on which of cm and cm' has a longer prefix 
of ±s. Part 2 uses a case analysis based on where k is with respect to the prefix of ±s. Part 3 and 
Part 4 are also straightforward. □ 

The next invariant (recall from Section 4.1 that this means a property of all states that arise 
in good executions of S) describes some properties of cmap i that hold while Reader- Writer i is 
conducting a configuration upgrade operation: 

Invariant 4.2 If upg. phase i / idle and upg.target i = k, then: 
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1. V£:£<k^ cmap(£)i G C U {±}. 

2. If k\ = m\n{£ : £ < k and upg.cmap(£) / ±} then k\ = or cmap(k\ — l)j = ±. 

Proof. By the precondition of cfg-upgrade(&)j and monotonicity of all the changes to cmap i . □ 

We next proceed to describe the patterns of C, _L, and ± values that may occur in configuration 
maps in various places in the system state. 

Invariant 4.3 Let cm be a CMap that appears as one of the following: 

1. The cm component of some message in in-transit. 

2. cmap i for any i G I. 

3. op.cmapi for some i G / for which op. phase / idle. 
4- query- cmap(n) or prop-cmap(n) for any operation n. 
5. upg.cmapi for some i € / for which upg. phase / idle. 

Then cm, G Usable. 

In the following proof and elsewhere, we use dot notation to indicate components of a state, for 
example, s.cmap i indicates the value of cmap i in state s. 

Proof. By induction on the length of a finite good execution. 

Base: Part 1 holds because initially, in-transit is empty. Part 2 holds because initially, for ev- 
ery i, cmap(0)i = cq and cmap(k)i = _L; the resulting CMap is in Usable. Part 3 and Part 5 
hold vacuously, because in the initial state, all op. phase and upg. phase values are idle. Part 4 also 
holds vacuously, because in the initial state, all query-cmap and prop-cmap variables are undefined. 

Inductive step: Let s and s' be the states before and after the new event, respectively. We consider 

Parts 1-5 one by one. 

For Part 1, the interesting case is a sendj event that puts a message containing cm in in-transit. 

The precondition on the send action implies that cm is set to s.cmap^ The inductive hypothesis, 

Part 2, implies that s.cmap i G Usable, which suffices. 

For Part 2, fix i. The interesting cases are those that may change cmap^ namely, new-configj, recvj 

for a gossip (non-join) message, and cfg-upg-prop-fixj. The latter case is the only one modified from 

the original RAMBO algorithm. 

1. new-config(c, *)j. 

By inductive hypothesis, s.cmap i G Usable. The only change this can make is changing a _L 
to c. Then Lemma 4.1, Part 2, implies that s 1 .cmap i G Usable. 

2. recv((*, *, cm, *, *)),-. 

By inductive hypothesis, cm G Usable and s.cmap i G Usable. The step sets s'.cmap i to 
update (s.cmapi, cm). Lemma 4.1, Part 1, then implies that s'.cmap i G Usable. 
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3. cfg-upg-prop-fix(ft)i. 

This sets cmap(£)i to ± for all £ < k. By the definition of this step, s' .cmap(£)i = ± for 
£ < k. 

If s.cmap(k — l)j = ±, then the operation has no effect, and s 1 .cmap i = s.cmap i G Usable. 
Assume, then, that s.cmap(k -l)j £ CU {-L}- This implies, by the inductive hypothesis 
showing s.cmapi G Usable, that s.cmap(£)i G CU {_L} for all £ > k — 1. By Invariant 4.2, we 
know that s.cmap(k)i G CU {±}, and therefore s.cmap(k)i G C. Therefore s'.cmap(k)i G C 
and s'.cmap(£)i G CU {_L} for all £ > k, since the cfg-upg- prop-fix does not change entries 
in the cmap larger than k — 1. Further, there are only finitely many entries in s.cmapi that 
are in C (by the inductive hypothesis), and so there are still only finitely many entries in 
s'.cmapi. Therefore, s'.cmapi G Usable. 

For Part 3, the interesting actions to consider are those that modify op. cmap, namely, readj, write^, 
recvj, and query-fixj. 

1. readj, write^, or query-fixj. 

By inductive hypothesis, s.cmap i G Usable. The new step sets s'. op. cmap i to truncate(s.cmap i ); 
since s.cmap i G Usable, Lemma 4.1, Part 4, implies that this is also usable. 

2. recv((*, *, cm, *, *))j. 

This step may alter op.cmap i only if s. op. phase G {query, prop}, and then in only two ways: 
by setting it either to extend (s. op. cmap i , truncate(cm)) or to truncate (update (s. cmap i , cm)). 
The inductive hypothesis implies that s.op.cmap^ cmap i , and cm are all in Usable. Lemma 4.1 
implies that truncate, extend, and update all preserve usability. Therefore, s' ' .op.cmap i G 
Usable. 

For Part 4, the actions to consider are query-fixj and prop-fixj. 

1. query-fix^. 

This sets s 1 '.query- cmap i to the value of s.op.cmap^ Since by inductive hypothesis the latter 
is usable, so is s'. query- cmap j. 

2. prop-fix,j. 

This sets s' '.prop- cmap j to the value of s.op.cmap^. Since by inductive hypothesis, the latter 
is usable, so is s 1 '.prop- cmap j. 

For Part 5, the actions to consider are cfg-upgrade(&)j and cfg-upg-query-fix(fc)j. These set s' .upg.cmap i 
to the value of s.cmap^ Since by the inductive hypothesis the latter is usable, so is s'.upg.cmap^ 

□ 

We now strengthen Invariant 4.3 to say more about the form of the CMaps that are used for 
read and write operations: 

Invariant 4.4 Let cm, be a CMap that appears as op.cmap i for some i G / for which op.phase i / 
idle, or as query- cmap (n) or prop- cmap (n) for any operation n. Then: 

1. cm G Truncated. 

2. cm, consists of finitely many ± entries followed by finitely many C entries followed by an 
infinite number of _L entries. 
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Proof. We prove that the desired properties hold for a cm that is op. cma i p i . The same properties 
for query-cmap i and prop-cmap i follow by the way they are defined, from op.cmap^ 

To prove Part 1 we proceed by induction. In the initial state, op.phase i = idle, which makes 
the claim vacuously true. For the inductive step we consider all actions that alter op.cmapf 

1. readj, write^, or query- fix i . 

These set op.cmap i to truncate (cmapj), which is necessarily in Truncated. 

2. recv,;. 

This first sets op.cmap i to a preliminary value and then tests if the result is in Truncated. 
If it is, we are done. If not, then this step resets op.cmap i to truncate (cmap J , which is in 
Truncated. 

To see Part 2, note that cm € Usable by Invariant 4.3. The fact that cm £ Truncated then 
follows from the definition of Usable and Part 1. □ 

4.4 Phase guarantees 

In this section, we present results saying what is achieved by the individual operation phases. We 
give four lemmas, describing the messages that must be sent and received and the information flow 
that must occur during the two phases of configuration-upgrades and during the two phases of read 
and write operations. 

Note that these lemmas treat the case where j = i uniformly with the case where j / i. This 
is because, in the Reader-Writer algorithm, communication from a location to itself is treated 
uniformly with communication between two different locations. We first consider the query phase 
of a configuration-upgrade: 

Lemma 4.5 Suppose that a cfg-upg-query-fix(&)j event for configuration upgrade operation 7 occurs 

in a and k' € removal-set^). Suppose j € #(7, k') U ^1(7, k'). 

Then there exist messages m from i to j and m' from j to i such that: 

1. m is sent after the cfg-upgrade(A;).; event 0/7. 

2. m! is sent after j receives m. 

3. m' is received before the cfg-upg-query-fix(A;),; event 0/7. 

J f . In any state after j receives m, cmap(£)j / _L for all £ < k. 

5. tag("y) > t, where t is the value of tag^ in any state before j sends message m' . 

Proof. The phase number discipline implies the existence of the claimed messages m and m'. 

For Part 4, the precondition of cfg-upgrade(&) implies that, when the cfg-upgrade(&)j event of 
7 occurs, cmap(£)i / _L for all £ < k. Therefore, j sets cmap(£)j ^ _L for all £ < k when it receives 
m. Monotonicity of cmapj ensures that this property persists forever. 

For Part 5, let t be the value of tag^ in any state before j sends message m'. Let t' be the value 
of tagj in the state just before j sends m' . Then t < £', by monotonicity. The tag component of 
m! is equal to £', by the code for send. Since i receives this message before the cfg-upg-query-fix(A;), 
it follows that top (7) is set by i to a value > t. □ 

Next, we consider the propagation phase of a configuration upgrade: 
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Lemma 4.6 Suppose that a cfg-upg-prop-fix(&)j event for a configuration upgrade operation 7 oc- 
curs in a. Suppose that j € ^2(7). 
Then there exist messages m from i to j and m' from j to i such that: 

1. m is sent after the cfg-upg-query-fix(A;),; event 0/7. 

2. m' is sent after j receives m. 

3. m' is received before the cfg-upg-prop-fix(A;),; event 0/7. 
J f . In any state after j receives m, tag^ > tag(^). 

Proof. The phase number discipline implies the existence of the claimed messages m and m'. 

For Part 4, when j receives m, it sets tag^ to be > top (7). Monotonicity of tag^ ensures that 
this property persists in later states. □ 

Next, we consider the query phase of read and write operations: 

Lemma 4.7 Suppose that a query-fix^ event for a read or write operation n occurs in a. Let 

k,k' € N. Suppose query- cmap(n)(k) € C and j € R(n,k). 

Then there exist messages m from i to j and m' from j to i such that: 

1. m is sent after the query-phase-start(-7r) event. 

2. m' is sent after j receives m. 

3. m' is received before the query-fix event of n. 

4- If t is the value of tag^ in any state before j sends m', then: 

(a) tag(n) > t. 

(b) If it is a write operation then tag(n) > t. 

5. If cmap(£)j / _L for all £ < k' in any state before j sends m' , then query- cmap(n)(£) € C for 
some £ > k' . 

Proof. The phase number discipline implies the existence of the claimed messages m and m'. 

For Part 4, the tag component of message m' is > t, so i receives a tag that is > t during the 
query phase of n. Therefore, tag(n) > t. Also, if n is a write, the effects of the query-fix imply that 
tag(n) > t. 

Finally, we show Part 5. In the cm component of message m', cm(£) ^ _L for all £ < k'. 
Therefore, truncate (cm) (£) = cm(£) for all £ < k' , so truncate (cm) (£) ^ _L for all £ < k' . 

Let cm' be the configuration map extend (op. cmap^ truncate (cm)) computed by i during the 
effects of the recv event for m'. Since i does not reset op. ace to in this step, by definition of the 
query-phase-start event, it follows that cm' € Truncated, and cm' is the value of op.cmap i just after 
the recv step. 

Fix £, < £ < k' . We claim that cm'(£) / _L. We consider cases: 

1. op.cmap(£)i / _L just before the recv step. 

Then the definition of extend implies that cm'(£) / _L, as needed. 
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2. op.cmap(£)i = _L just before the recv step and truncate (cm) (£) G C. 

Then the definition of extend implies that cm'(l) G C, which implies that cm'(£) / _L, as 
needed. 

3. op.cmap(£)i = _L just before the recv step and truncate (cm) (£) ^ C. 

Since truncate (cm) (£) / _L, it follows that truncate (cm) (£) = ±. Since truncate (cm) (£) = ± 
and truncate(cm) G Usable, it follows that, for some £' > £, truncate (cm) (£') G C. 

By the case assumption, op.cmap(£)i = _L just before the recv step. Since, by Invariant 4.4, 
op.cmap i G Truncated, it follows that op.cmap(£')i = _L before the recv step. 

Then by definition of extend, we have that cm!(£) = _L while cm!(£') G C. This implies that 
cm' ^ Truncated, which contradicts the fact, already shown, that cm' ^ Truncated, So this 
case cannot arise. 

Since this argument holds for all £, < £ < k! , it follows that cm'(£) / _L for all £ < k'. Since 
cm'(£) / _L for all £ < k', Invariant 4.3 implies that cm' G Usable, which implies by definition of 
Usable that cm'(£) G C for some £ > k! . That is, op.cmap^tj G C for some £ > k! immediately 
after the recv step. This implies that query- cmap(-K)(£) G C for some £ > k' , as needed. □ 

And finally, we consider the propagation phase of read and write operations: 

Lemma 4.8 Suppose that a prop-fix,j event for a read or write operation n occurs in a. Suppose 

prop-cmap(n)(k) G C and j G W(n,k). 

Then there exist messages m from i to j and m' from j to i such that: 

1. m is sent after the prop-phase-start(7r) event. 

2. m' is sent after j receives m. 

3. m' is received before the prop-fix event of n. 

J f . In any state after j receives m, tagj > tag(n). 

5. If cmap(£)j / _L for all £ < k! in any state before j sends m', then prop-cmap(n)(£) G C for 
some £ > k' . 

Proof. The phase number discipline implies the existence of the claimed messages m and m'. 

For Part 4, let m.tag be the tag field of message m. Since m is sent after the prop-phase-start 
event, which is not earlier than the query-fix, it must be that m.tag > tag(n). Therefore, by the 
effects of the recv, just after j receives m, tagj > m.tag > tag(n). Then monotonicity of tagj 
implies that tagj > tag(n) in any state after j receives m. 

For Part 5, the proof is analogous to the proof of Part 5 of Lemma 4.7. In fact, it is identical 
except for the final conclusion, which now says that prop-cmap(n)(£) G C for some £ > k! . □ 

5 Atomic Consistency 

This section contains the proof of atomic consistency. The proof is carried out in several stages. 
First in Section 5.1 we present some lemmas about the new configuration management mechanism, 
describing the relationship between configuration upgrade operations. Section 5.2 describes the 
relationship between read/write operations and configuration upgrade operations. Section 5.3 then 
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considers two read or write operations, and culminates in Lemma 5.11, which says that tags are 
monotonic with respect to non-concurrent read or write operations. Finally, Section 5.4 uses the 
tags to define a partial order on operations and verifies the four properties required for atomicity. 

5.1 Behavior of configuration upgrade 

This section presents the key new technical lemmas on which the proof of atomicity is based. Specif- 
ically, we present lemmas describing information flow between configuration upgrade operations. 
These lemmas assert the existence of a sequence of configuration upgrade operations on which we 
can make certain necessary guarantees. In particular, the key property is that the tags are mono- 
tonically increasing with respect to the specific sequence of upgrade operations, guaranteeing that 
value/tag information is propagated to newer configurations. 

The first lemma shows that if all configuration upgrade operations remove two particular con- 
figurations together, then those two configuration are always in the same state in all cmaps. 

Lemma 5.1 Suppose that k > 0, and a is an execution in which no cfg-upg-prop-fix(A;) event occurs 
in a. Suppose that cm is a CMap that appears as one of the following in any state in a: 

1. The cm component of some message in in-transit. 

2. cmap i for any i £ /. 

// cm(k — 1) = ± then cm(k) = ±. 

Proof. Fix some a and k > such that no cfg-upg-prop-fix(A;) event occurs in a. We pro- 
ceed by induction on the length of a finite prefix of a: for every action in a, if before the action 
cm(k — 1) = ± => cm(k) = ±, then the same implication holds after the action. 

Base: For Part 1, the conclusion follows vacuously because initially in-transit is empty. For Part 
2, the conclusion again follows vacuously because initially cmap^l) / ± for all i and I. 

Inductive step: Let s and s' be the states before and after the new event, respectively. We consider 

Parts 1 and 2 separately. 

For Part 1, the interesting case is a sendj event that puts a message containing cm in in-transit. 

The precondition on the send action implies that cm is set to s.cmap^ The inductive hypothesis, 

Part 2, implies that if s.cmap(k — 1) = ±, then s.cmap(k) = ±. Therefore in state s', the same 

holds for cm, which has been added to in-transit. 

For Part 2, fix i. The interesting cases are those that may change cmap^ namely, new-configj, recvj 

for a gossip message, and cfg-upg-prop-fix,j. 

1. new-config(c, *).;. 

If s' .cmap(k — l)j = ±, then s.cmap(k — l)j = ±, since installing a new configuration does 
not set any entry to ±. Then by the inductive hypothesis s.cmap(k)i = ±, which implies 
that s'.cmap(k)i = ±, since this action cannot modify an entry that is already ±. 

2. recv((*, *, cm, *, *)),-. 

First, if cm(0) ^ ±, then the message does not cause any entry in s.cmap to be set to ±, 
and as in Case 1 the desired property still holds. Also, if s.cmap (0) / ±, then for all £, 
s'.cmap(i) = ± if and only if cm(£) = ±. By the inductive hypothesis cm(k — 1) = ± =^- 
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cm(k) = ±, so the desired conclusion follows. For the rest of this case, we will assume that 
cm(0) = ± and s.cmap(0) = ±. 

By Invariant 4.3, cm € Usable. Therefore we can define k msg . max such that cm(£) = ± for 
all £ < k msg . max and cm(£) / ± for all £ > k msg . max . Similarly, we can define k max such that 
s.cmap(£)i = ± for all £ < k max and s.cmap(£)i ^ ± for all £ > k max . Define k' max in the 
same way for the poststate, s'. 

There are two cases. First, assume k max > k msg . max . Then k' max = k max , by the monotonicity 
of CMap. By our inductive hypothesis s.cmap(k — 1) = ± => s.cmap(k) = ±; it follows, 
then, that if k — 1 < k max then k < k max . Therefore if k — 1 < k' max , then k < k' max . Finally, 
then, if s'.cmap(k — 1) = ±, then s'.cmap(k) = ±. 

Assume, then, that k msg . max > k max . Then after the update operation, k' max = k msg . max . 
By our inductive hypothesis, cm(k — 1) = ± => cm(k) = ±; it follows, then, that if 
k-1 < k msg . max , then k < k msg . max . Therefore if k - 1 < k' max , then k < k' max . Finally, then, 
s'.cmap(k — 1) = ± implies that s'.cmap(k) = ±. 

3. cfg-upg-prop-fix^Oi- 

By assumption, k / k' . If k < k' , then this operation sets both s'.cmap(k — 1)$ = ± and 
s'.cmap(k)i = ±. If k > k', then this operation has no effect on cmap(k)i or cmap(k — l)j, 
and the desired property still holds. 

□ 

The following corollary says that if a cfg-upgrade(&) event for an upgrade operation 7 occurs in 
an execution, then there is some previous configuration upgrade operation 7' (that completes before 
the upgrade event) where the target of 7' is the configuration with the smallest index removed by 

7- 

Corollary 5.2 Let^ be a configuration upgrade operation, initiated by a cfg-upgrade(A;).; event in a, 
and let k\ = m.\n{removal-set(^)} . That is, k\ is the smallest element such that upg-cmap(^)(k\) G 
C. Assume k\ > 0. Then a cfg-upg-prop-fix(A;i)j event for some configuration upgrade operation 7' 
occurs in a for some j such that the cfg-upg-prop-fix^ event 0/7' precedes the cfg-upgrade(A;).; event 
in a. 

Proof. By the definition of k\, we know that in the state just after the cfg-upgrade event, 
upg.cmap(k\ — 1)$ = ± and upg.cmap(k\)i / ±. Since upg.cmapi is set by the cfg-upgrade event 
to cmap.1 in the state just prior to the cfg-upgrade event, we know that cmap(k\ — 1)$ = ± and 
cmap(k\)i / ± in the state just prior to the cfg-upgrade event. Lemma 5.1, then, implies that some 
cfg-upgrade-prop-fix(A;i) event for some operation 7' occurs in a preceding the cfg-upgrade event. 

□ 

The next lemma says that for a given configuration upgrade operation 7, there exists a sequence 
of preceding upgrade operations satisfying certain properties. The lemma begins by assuming 
that some configuration with index k is removed by the specified upgrade operation. For every 
configuration with an index smaller than k, we choose a single upgrade operation - that removes 
that configuration - to add to the sequence. Therefore the constructed sequence may well contain 
the same configuration upgrade operation multiple times, if the operation has removed multiple 
configurations. If two elements in the sequence are distinct upgrade operations, then the earlier 
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operation in the sequence completes before the later operation in the sequence is initiated. Also, the 
target of an upgrade operation in the sequence is removed by the next distinct upgrade operation in 
the sequence. As a result of these properties, the configuration upgrade process obeys a sequential 
discipline. 

Lemma 5.3 If a cfg-upgrade,j event for upgrade operation 7 occurs in a such thatk G removal-set^), 
then there exists a sequence (possibly containing repeated elements) of configuration upgrade oper- 
ations 70, 71, . . . , 7/t with the following properties: 

1. V s : < s < k, s G removal-set(-y s ), 

2. V s : < s < k, if 7 S / ls+\, then the cfg-upg- prop-fix event of 7*. occurs in a and 
the cfg-upgrade event of 7 s+ i occurs in a, and the cfg-upg- prop-fix event of 7*. precedes the 
cfg-upgrade event of^s+i, o,nd 

3. V s : < s < k, if "isi^ 7s+i> then target(^ s ) G removal-set^ s +i). 

Proof. We construct the sequence in reverse order, first defining 7^, and then at each step defining 
the preceding element. We prove the lemma by backward induction on £, for £ = k down to £ = 0, 
maintaining the following three properties at each step of the induction: 

1'. V s : £ < s < k, s G removal-set^ s ), 

2' . V s : £ < s < k, if 7 S 7^ 7s+i, then the cfg-upg-prop-fix event of 7*. occurs in a and the 
cfg-upgrade event of 7 s+ i occurs in a, and the cfg-upg-prop-fix event of 7*. precedes the 
cfg-upgrade event of 7 s +i, and 

3'. V s : £ < s < k, if 7*. / 7 s +i, then target(^ s ) G removal-set^s+i). 

To begin the induction, we first examine the base case, where £ = k. Define 7^. = 7. Property 1' 
holds by assumption, and Property 2' and Property 3' are vacuously true. 

For the inductive step, we assume that 7^ has been defined and that properties l'-3' hold. 
If £ = 0, then 70 has been defined, and we are done. Otherwise, we need to define 7^-1- If 
f-1 £ removal-set^i), then let 7i_i = 7^, and all the properties still hold. 

Otherwise, £—1 ^ removal-set^i) and^ G removal-set^i), which implies that £ = m.\n{removal-set(^t)} 
because each configuration upgrade operates on a consecutive sequence of configurations. Then by 
Corollary 5.2, there occurs in a a configuration upgrade operation, that we label 7^-1, with the 
following properties: (i) the cfg-upg-prop-fix event of 7^_i precedes the cfg-upgrade event of 7^, and 
(ii) target^i-i) = mm{k' : k' G removal- set^i)}. 

Recall that £ = m.\n{removal-set(^()} . Therefore, by Property (ii) of 7^_i, target^i-i) = £. 
Since removal- set(^t-\) / 0, this implies that £ — 1 G removal-set(^t-\) , proving Property 1'. 
Property 2' follows from Property (i) of 7^_i- Property 3' follows from Property (ii) of 7^_i- □ 

The sequential nature of configuration upgrade has a nice consequence for propagation of tags: 
for any sequence of upgrade operations like that in Lemma 5.3, tag(^ s ) is nondecreasing in s. 

Lemma 5.4 Let 7^, ... ,7^ be a sequence of configuration upgrade operations such that: 
1. V s : < s < k, sG removal-set(^ s ), 
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2. V s : < s < k, if 7*. / 7s+i; ^en i/ie cfg-upg- prop-fix eweni 0/ 7*. occurs in a and 
the cfg-upgrade event of 7 s+ i occurs in a, and the cfg-upg- prop-fix event of 7*. precedes the 
cfg-upgrade event of 7 s +i, and 

3. V s : < s < k, if "isi^ 7s+i> then target(^ s ) G removal-set^s+i). 
Then V s : < s < k, tag("y s ) < tag^s+i). 

Proof. If 7 S = 7 s +i, then it is trivially true that tag("y s ) < tagi^s+i). Therefore assume that 7*. / 
7 s _l_i; this implies that the cfg-upg- prop-fix event of 7*. precedes the cfg-upgrade event of 7 s +i- Let ki 
be the largest element in removal-set^ s ). We know by assumption that fe + l£ removal-set (7 s+ i). 
Therefore, ^2(7^), a write-quorum of configuration c{ki + 1), has at least one element in common 
with i?(7 s+ i, ki + 1); label this node j. By Lemma 4.6, and the monotonicity of tag^ after the 
cfg-upg- prop-fix event of 7*. we know that tag^ > tag(^ s ). Then by Lemma 4.5 tag^s+i) > tagj. 
Therefore tag(^ s ) < tag (7^+1). □ 

Corollary 5.5 Let 7^, ... ,7^ be a sequence of configuration upgrade operations such that: 

1. V s : < s < k, s E rem oval- set {7 s ) , 

2. V s : < s < k, if 7*. / 7s+i> then the cfg-upg- prop-fix event of 7*. occurs in a and 
the cfg-upgrade event of 7 s+ i occurs in a, and the cfg-upg- prop-fix event of 7*. precedes the 
cfg-upgrade event of 7 s +i, anof 

5. V s : < s < k, if "isi^ 7s+i> i^en target(^ s ) G removal-set (7^+1 ). 
T/ien V s, s' : < s < s' < k, tag(^ s ) < tag(^ s i) 
Proof. This follows immediately from Lemma 5.4 by induction. □ 

5.2 Behavior of a read or a write following a configuration upgrade 

Now we describe the relationship between an upgrade operation and a following read or write op- 
eration. These three lemmas relate the removal-set of a preceding configuration upgrade operation 
with the query-cmap of a later read or write operation. 

The first lemma shows that if, for some read or write operation, k is the smallest index such 
that query- cmap(k) G C, then some configuration upgrade operation with target k precedes the 
read or write operation. 

Lemma 5.6 Let n be a read or write operation whose query-fix event occurs in a. Let k be the 
smallest element such that query- cmap(n)(k) G C. Assume k > 0. Then there must exist a 
configuration upgrade operation 7 such that k = target^), and the cfg-upg- prop-fix event of 7 
precedes the query-phase-start(-7r) event. 

Proof. This follows from Lemma 5.1. Let s be the state just before the query-phase-start(7r) 
event. By definition, query- cmap(n) = s.cmap^ Since s.cmap(k — 1)$ = ± and s.cmap(k)i ^ ±, 
there must exist such a configuration upgrade operation for k by the contrapositive of Lemma 5.1. 

□ 

Second, if some upgrade removing k does complete before the query-phase-start event of a read 
or write operation, then some configuration with index > k + 1 must be included in the query-cmap 
of a later read or write operation. 
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Lemma 5.7 Let 7 be a configuration upgrade operation such that k E removal- set (7). Xei it be a 
read or write operation whose query-fix event occurs in a. Suppose that the cfg-upg- prop-fix event 
°f 1 precedes the query-phase-start(7r) event in a. 
Then query- cmap(n)(£) E C for some £ > k + 1. 

Proof. Suppose for the sake of contradiction that query- cmap(n)(£) ^ C for all £ > k + 1. Fix 
A;' = max({f : query- cmap(ir)(£') E C}). Then A;' < k. 

Let 7oj • • • j 7fc be the sequence of upgrade operations whose existence is asserted by Lemma 5.3, 
where 7^ = 7. Then, by this construction, k! E removal-set^^') , and the cfg-upg- prop-fix event of 
7fc' does not come after the cfg-upg- prop-fix event of 7 in a. By assumption, the cfg-upg- prop-fix 
event of 7 precedes the query-phase-start(-7r) event in a. Therefore the cfg-upg- prop-fix event of 7^ 
precedes the query-phase-start(-7r) event in a. 

Then, since k! E removal-set^^')-, write-quorum Wi^w ,k') is defined. Since query- cmap(k') E 
C), the read-quorum R(n,k') is defined. Choose j E ^1(7/^, k') r\R(ir,k'). Assume that kt = 
target^h')- Notice that k! < kf. Then Lemma 4.5 and monotonicity of cmap imply that, in the 
state just prior to the cfg-upg-query-fix event of 7^, cmap(£)j / _L for all £ < kf. Then Lemma 4.7 
implies that query- cmap (n)(£) E C for some £ > kf. But this contradicts the choice of k'. □ 

The next lemma describes propagation of tag information from a configuration upgrade opera- 
tion to a following read or write operation. For this lemma, we assume that query-cmap(k) E C, 
where k is the target of the upgrade operation, 

Lemma 5.8 Let 7 be a configuration upgrade operation. Assume that k = target^). Let it be a 
read or write operation whose query-fix event occurs in a. Suppose that the cfg-upg- prop-fix event of 
7 precedes the query-phase-start(-7r) event in execution a. Suppose also that query- cmap (n)(k) E C. 
Then: 

1. tag (7) < tag {it). 

2. If it is a write operation then top (7) < tag(n). 

Proof. The propagation phase of 7 accesses write-quorum ^(7) of c(k), whereas the query 
phase of n accesses read-quorum R(ir, k). Since both are quorums of configuration c(k), they have 
a nonempty intersection; choose j E ^2(7) fl R(ir, k). 

Lemma 4.6 implies that, in any state after the cfg-upg- prop-fix event for 7, tag^ > top (7). Since 
the cfg-upg- prop-fix event of 7 precedes the query-phase-start(-7r) event, we have that t > top (7), 
where t is defined to be the value of top^ just before the query-phase-start(-7r) event. Then Lemma 4.7 
implies that tag(ir) > t, and if n is a write operation, then tag(n) > t. Combining the inequalities 
yields both conclusions of the lemma. □ 

5.3 Behavior of sequential reads and writes 

Read or write operations that originate at different locations may proceed concurrently. However, 
in the special case where they execute sequentially, we can prove some relationships between their 
query-cmaps, prop-cmaps, and tops. The first lemma says that, when two read or write operations 
execute sequentially, the smallest configuration index used in the propagation phase of the first 
operation is less than or equal to the largest index used in the query phase of the second. In other 
words, we cannot have a situation in which the second operation's query phase executes using only 
configurations with indices that are strictly less than any used in the first operation's propagation 
phase. 
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Lemma 5.9 Assume ni and 7T2 are two read or write operations, such that: 

1. The prop-fix event of ni occurs in a. 

2. The query-fix event of -K2 occurs in a. 

3. The prop-fix event of ni precedes the query-phase-start^) event. 
Then min({i? : prop-cmap(ni)(£) G C}) < max({l : query- cmap (7^2) (£) G C}). 

Proof. Suppose for the sake of contradiction that min({^ : prop-cmap(iri)(l) G C}) > k, where 
k is defined to be max({l : query- cmap (7^2) (£) G C}). Then in particular, prop-cmap(iri)(k) ^ C. 
The form of prop-cmap^i), as expressed in Invariant 4.4, implies that prop-cmap(ni)(k) = ±. 

This implies that some cfg-upg- prop-fix event for some upgrade operation 7 such that k G 
removal-set^) occurs prior to the prop-fix of 7ri, and hence prior to the query-phase-start^) event. 
Lemma 5.7 then implies that query- cmap (7^2) (i) G C for some £ > k + 1. But this contradicts the 
choice of k. □ 

The next lemma describes propagation of tag information, in the case where the propagation 
phase of the first operation and the query phase of the second operation share a configuration. 

Lemma 5.10 Assume tt\ and 1^2 are two read or write operations, and k G N, such that: 

1. The prop-fix event of ni occurs in a. 

2. The query-fix event of -K2 occurs in a. 

3. The prop-fix event of ni precedes the query-phase-start^) event. 
J f . prop-cmap(ni)(k) and query-cmap(-K2)(k) are both in C. 

Then: 

1. tag (iri) < tag(ir 2 ). 

2. If 7T2 is a write then tag(ni) < tag(-K2). 

Proof. The hypotheses imply that prop-cmap(ni) (k) = query- cmap (7^2) (k) = c(k). Then W(ni, k) 
and R(n2,k) are both defined in a. Since they are both quorums of configuration c(k), they have 
a nonempty intersection; choose j G W(ni, k) D R(tt2, k). 

Lemma 4.8 implies that, in any state after the prop-fix event of 7ri, tag^ > tag(ni). Since the 
prop-fix event of ni precedes the query-phase-start^) event, we have that t > tag(ni), where t is 
defined to be the value of tag^ just before the query-phase-start^) event. Then Lemma 4.7 implies 
that tag{^2) > t, and if 7T2 is a write operation, then tag(n2) > t. Combining the inequalities yields 
both conclusions. □ 

The final lemma is similar to the previous one, but it does not assume that the propagation 
phase of the first operation and the query phase of the second operation share a configuration. The 
main focus of the proof is on the situation where all the configuration indices used in the query 
phase of the second operation are greater than those used in the propagation phase of the first 
operation. 
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Lemma 5.11 Assume ni and 7T2 are two read or write operations, such that: 

1. The prop-fix of ni occurs in a. 

2. The query-fix of -K2 occurs in a. 

3. The prop-fix event of ni precedes the query-phase-start^) event. 
Then: 

1. tag(iri) < tag(ir 2 ). 

2. If 7T2 is a write then tag(ni) < tag(-K2)- 

Proof. Let i\ and %2 be the indices of the processes that run operations ni and 7T2, respectively. 
Let cm\ = prop-cmap(ni) and cm,2 = query- cmapijt^)- If there exists k such that cm\(k) G C and 
cm,2(k) G C, then Lemma 5.10 implies the conclusions of the lemma. So from now on, we assume 
that no such k exists. 

Lemma 5.9 implies that min({^ : cmi(l) G C}) < max({l : cm,2(£) G C}). Invariant 4.4 implies 
that the set of indices used in each phase consists of consecutive integers. Since the intervals have 
no indices in common, it follows that s\ < S2, where si is defined to be max({l : cmi(£) G C}) and 
S2 is defined to be min({^ : cm,2(£) G C}). 

Lemma 5.6 implies that there exists a configuration upgrade operation that we will call 7s 2 -i 
such that S2 = target^ s 2 -i), and the cfg-upg-prop-fix of 7 S2 _i precedes the query-phase-start^) 
event. Then by Lemma 5.8, tag(^ S2 -\) < tag(-K2), and if 7T2 is a write operation then tag(^ S2 -\) < 
tag(TT 2 ). 

Next we will demonstrate a chain of configuration upgrade operations with non-decreasing tags. 
Lemma 5.3, in conjunction with the already defined 7 S2 -i) implies the existence of a sequence of 
configuration upgrade operations 70, ... , 7s 2 -i such that: 

1. V s : < s < S2 — 1, sG removal-set^s), 

2. V s : < s < S2 — 1, if 7s / 7s+i 5 then the cfg-upg-prop-fix event of 7*. precedes the cfg-upgrade 
event of 7 s+ i in a, 

3. V s : < s < S2 — 1, if 7s / 7s+i 5 then target(-j s ) G removal-set(^ s+ i). 

As a special case of Property 1, since s\ < S2 — 1, we know that si G removal-set(^ Sl ) . Then 
Corollary 5.5 implies that tag(^ Sl ) < tag(^ S2 -\). 

It remains to show that the tag of ni is no greater than the tag of 7 Sl . Therefore we focus now 
on the relationship between operation ni and configuration upgrade 7 Sl . The propagation phase of 
7Ti accesses write-quorum W(ni,si) of configuration c(si), whereas the query phase of 7 Sl accesses 
read-quorum R('j Sl ,si) of configuration c(s\). Since W(ni,si) n R( , y Sl ,si) / 0, we may fix some 
j G W(ni,si) n i?(7 Sl ,si). Let message m\ from i\ to j and message m' x from j to i\ be as in 
Lemma 4.8 for the propagation phase of r y Sl . 

Let message m,2 from the process running 7 Sl to j and message m' 2 from j to the process running 
7 Sl be the messages whose existence is asserted in Lemma 4.5 for the query phase of j Sl . 

We claim that j sends m^, its message for m, before it sends m' 2 , its message for r y Sl . Suppose 
for the sake of contradiction that j sends m' 2 before it sends m! x . Assume that St = target ( , y ai . 
Notice that St > s\, since si G removal-set^ Sl ). Lemma 4.5 implies that in any state after j 
receives 7712, before j sends m' 2 , cmap(k)j / _L for all k < Sf. Since j sends m' 2 before it sends 
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m' l5 monotonicity of cmap implies that just before j sends m' l5 cmap(k)j / _L for all A; < s^. Then 
Lemma 4.8 implies that prop- cmap (-ki)(£) G C for some £ > Sf. But this contradicts the choice of 
si, since s\ < Sf. This implies that j sends m' l before it sends m' 2 . 

Since j sends m' x before it sends m' 2 , Lemma 4.8 implies that, at the time j sends m' 2 , tag(ni) < 
tagj. Then Lemma 4.5 implies that tag(ni) < tag(^ Sl ). Prom above, we know that tag(^ Sl ) < 
tag( , y S2 -i), and tag(^ S2 -\) < tag(-K2), and if 7T2 is a write operation then tag(^ S2 -\) < tag(-K2). 
Combining the various inequalities then yields both conclusions. □ 

5.4 Atomicity 

In order to prove that all executions of Rambo II are atomic, we use four sufficient conditions. A 
memory is said to be atomic provided that the following conditions hold for all good executions: 

• If all the read and write operations that are invoked complete, then the read and write 
operations for object x can be partially ordered by an ordering -<;, so that: 

1. No operation has infinitely many other operations ordered before it. 

2. The partial order is consistent with the external order of invocations and responses, that 
is, there do not exist read or write operations ni and 7T2 such that ni completes before 
7T2 starts, yet 7T2 -< it\. 

3. All write operations are totally ordered and every read operation is ordered with respect 
to all the writes. 

4. Every read operation ordered after any writes returns the value of the last write preceding 
it in the partial order; any read operation ordered before all writes returns the initial 

value. 

This definition is sufficient to guarantee atomicity in terms of the other common definition which 
is defined in terms of equivalence to a serial memory. (See, for example, Lemma 13.16 in [11].) 

Let j3 be a trace of 5, the system that implements Rambo II (recall that this includes the 
Reader-Writer, Recon and Joiner automata), and assume that all read and write operations com- 
plete in (3. Consider any particular good execution a of S whose trace is (3. We define a partial 
order -<; on read and write operations in /3, in terms of the operations' tags in a. Namely, we totally 
order the writes in order of their tags, and we order each read with respect to all the writes as 
follows: a read with tag t is ordered after all writes with tags < t and before all writes with tags 
> t. 

Lemma 5.12 The ordering -< is well-defined. 

Proof. The key is to show that no two write operations get assigned the same tag. This is obvi- 
ously true for two writes that are initiated at different locations, because the low-order tiebreaker 
identifiers are different. For two writes at the same location, Lemma 5.11 implies that the tag of 
the second is greater than the tag of the first. This suffices. □ 

Lemma 5.13 -< satisfies the four conditions in the definition of atomicity. 

Proof. We begin with Property 2, which as usual in such proofs, is the most interesting thing to 
show. Suppose for the sake of contradiction that ni completes before 7T2 starts, yet TT2 -< vri. We 
consider two cases: 
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1. 7T2 is a write operation. 

Since tt\ completes before 7T2 starts, Lemma 5.11 implies that tag {1^2) > tag(iri). On the 
other hand, the fact that 7T2 -< ni implies that tag (7^2) < tag(ni). This yields a contradiction. 

2. 7T2 is a read operation. 

Since tt\ completes before 7T2 starts, Lemma 5.11 implies that tag(-K2) > tag(ni). On the 
other hand, the fact that 7T2 -< ni implies that tag (7^2) < tag(ni). This yields a contradiction. 

Since we have a contradiction in either case, Property 2 must hold. 

Property 1 follows from Property 2. Properties 3 and 4 are straightforward. □ 

Now we tie everything together for the proof of Theorem 5.14. 

Theorem 5.14 Let (3 be a trace of S, the system that implements Rambo II. Then (3 satisfies the 
atomicity guarantee. 

Proof. Assume that all read and write operations complete in (3. Let abea good execution of 

5 whose trace is (3. Define the ordering -<; on the read and write operations in (3 as above, using 
the execution a. Then Lemma 5.13 says that -< satisfies the four conditions in the definition of 
atomicity. Thus, (3 satisfies the atomicity condition, as needed. □ 

6 Reconfiguration Service 

In this section we present the specification and implementation for the reconfiguration specification. 
This section is a restatement of Sections 4 and 7 of the Rambo technical report, and is taken 
directly from [13]. Our Rambo implementation for each object x consists of a main Reader- Writer 
algorithm and a reconfiguration service, Recon(x); since we are suppressing mention of x, we write 
this simply as Recon. First, in Section 6.1, we present the specification for the Recon service, as 
an external signature and set of traces. In Section 6.2, we present our implementation of Recon. 

6.1 Reconfiguration Service Specification 

The interface for Recon appears in Figure 9. The client of Recon at location i requests to join 
the reconfiguration service by performing a j oi n( recon )j input action. The service acknowledges 
this with a corresponding join-ackj output action. The client requests to reconfigure the object 
using a recorij input, which is acknowledged with a recon-ackj output action. Rambo reports a new 
configuration to the client using a report^ output action. Crashes are modeled using fail actions. 

Recon also produces outputs of the form new-config(c, k)i, which announce at location i that c 
is the k th configuration identifier for the object. These outputs are used for communication with 
the portion of the Reader-Writer algorithm running at location i. Recon announces consistent 
information, only one configuration identifier per index in the configuration identifier sequence. 
It delivers information about each configuration to members of the new configuration and of the 
immediately preceding configuration. 

Now we define the set of traces describing Recon's safety properties. Again, these are defined in 
terms of environment assumptions and and service guarantees. The environment assumptions are 
simple well-formedness conditions, consistent with the well-formedness assumptions for RAMBO: 

• Well-formedness: 
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Input: Output: 

join(recon)j, i € I join-ack(recon)j, i € I 

recon(c, c')i, c,c € C, i € members(c) recon-ack(6)j, b € {ok,nok},i € I 

faili, i £ I report(c)i, c 6 C,i £ 7 

new-config(c, ft)i, c 6 C, & 6 N + , i G J 

Figure 9: Recon: External signature 

— For every i: 

* No join(recon)j or recon(*, *),j event is preceded by a failj event. 

* At most one join(recon)j event occurs. 

* Any recon(*,*),; event is preceded by a join-ack(recon)j event. 

* Any recon(*, *),; event is preceded by an -ack for any preceding recon(*, *),; event. 

— For every c, at most one recon (*,c)* event occurs. 

— For every c, c', x, and «, if a recon(c, c')j event occurs, then it is preceded by: 

* A report(c)j event, and 

* A join-ack(recon)j for every j G members(c'). 

The safety guarantees provided by the service are as follows: 

• Well-formedness: For every i: 

— No join-ack(recon)j, recon-ack(*),;, report(*)j, or new-config(*, *),; event is preceded by a 
failj event. 

— Anyjoin-ack(recon)j (resp., recon-ack(c)j) event has aprecedingjoin(recon)j (resp., recorij) 
event with no intervening invocation or response action for x and i. 

• Agreement: If new-config(c, k)i and new-config(c', k)j both occur, then c = c'. (No disagree- 
ment arises about what the k th configuration identifier is, for any k.) 

• Validity: If new-config(c, k)i occurs, then it is preceded by a recon(*, c)j/ for some i' for which 
a matching recon-ack(rao&)j' does not occur. (Any configuration identifier that is announced 
was previously requested by someone who did not receive a negative acknowledgment.) 

• No duplication: If new-config(c, k)i and new-config(c, A;')^ both occur, then k = k! . (The 
same configuration identifier cannot be assigned to two different positions in the sequence of 
configuration identifiers.) 

6.2 Reconfiguration Service Implementation 

In this section, we describe a distributed algorithm that implements the Recon service for a par- 
ticular object x (and we suppress mention of x). This algorithm is considerably simpler than the 
Reader-Writer algorithm. It consists of a Reconi automaton for each location i, which interacts 
with a collection of global consensus services Cons(k,c), one for each k > 1 and each c £ C, and 
with a point-to-point communication service. 

Cons(k, c) accepts inputs from members of configuration c, which it assumes to be the k — 1 st 
configuration. These inputs are proposed new configurations. The decision reached by Cons(k,c), 
which must be one of the proposed configurations, is determined to be the k th configuration. 
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Recorii is activated by the joining protocol. It processes reconfiguration requests using the con- 
sensus services, and records the new configurations that the consensus services determine. Recorii 
also conveys information about new configurations to the members of those configurations, and re- 
leases new configurations for use by Reader- Writer $. It returns acknowledgments and configuration 
reports to its client. 

6.3 Consensus services 

In this section, we specify the behavior we assume for consensus service Cons(k, c), for a fixed k > 1 
and c € C. This behavior can be achieved using the Paxos consensus algorithm [9], as described 
formally in [14]. Fix V to be the set of consensus values. (In the implementation of the Recon 
service, V will be instantiated as C.) The external signature of Cons(k,c) is given in Figure 10. 

Input: Output: 

i n it(w)*:,c,i, v € V, i € members(c) decide(w) k,c,i, v € V, i € members(c) 

failj, i € members(c) 

Figure 10: Cons(k,c): External signature 

We describe the safety properties of Cons(k, c) in terms of properties of a trace (3 of actions in 
the external signature. Namely, we define the client safety assumptions: 

• Well-formedness: For any i € members(c): 

— No init(*)fe )C) j event is preceded by a fail(i) event. 

— At most one init(*)ft,c,i event occurs in (3. 

And we define the consensus safety guarantees: 

• Well-formedness: For any i € members(c): 

— No decide(*)^ )C) j event is preceded by a fail(i) event. 

— At most one decide(*)^ )C) j event occurs in (3. 

— If a decide(*)^ )C) j event occurs in /3, then it is preceded by an init(*)fc )C ,i event. 

• Agreement: If decide(u)& )C) j and decide(u')^ )C) j' events occur in /3, then v = v'. 

• Validity: If a dec\de(v)k, c ,i event occurs in /3, then it is preceded by an \ri\t(v)k, c ,j- 

We assume that the Cons(k,c) service is implemented using the Paxos algorithm [9], as de- 
scribed formally in [14]. This satisfies the safety guarantees described above, based on the safety 
assumptions: 

Theorem 6.1 If (3 is a trace of Paxos that satisfies the safety assumptions of Cons(k,c), then j3 
also satisfies the (well-formedness, agreement, and validity) safety guarantees of Cons(k,c). 

The Paxos algorithm also satisfies the following latency result: 

Theorem 6.2 Consider a timed execution a of the Paxos algorithm and a prefix a' of a. Suppose 
that: 
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1. The underlying system "behaves well" after a', in the sense that timing is "normal" (what is 
called "regular" in [14]) s and no process failures or message losses occur. 

2. For every i that does not fail in a, an init(*)j event occurs in a' . 

3. There exist R € read- quorums (c) and W € write- quorums (c) such that for all i € RU W, no 
failj event occurs in a. 

Then for every i that does not fail in a, a decide(*).; event occurs, no later than 9d + e time after 
the end of a' (e > 0). 

6.4 Recon automata 

A Reconi process is responsible for initiating consensus executions to help determine new con- 
figurations, for telling the local Reader-Writer \ process about a newly-determined configuration, 
and for disseminating information about newly-determined configurations to the members of those 
configurations. The signature and state of Reconi appear in Figures 11, and the transitions in 
Figure 12. 

Signature: 



Input: 

join(recon)j 

recon (c, c')i,c, c G C, i G members (c) 
decide(c)A;,i,c 6 C, k G N+ 
recv((config,c, k))j,i, c G C, k G fC 1 ", 

i G members(c), j € I — {i} 
recv((init, c, c, k))j,i, c, c 6 C, k € N + , 

i, j £ members (c), j ^ i 
failj 



State: 

status 6 {id/e, active}, initially idle, 
rec-cmap 6 CMap, initially rec-cmap(0) = Co 

and rec-cmap(k) = _L for all k ^= 0. 
did-init C N+ , initially 
did-new-config C N+, initially 



Output: 
join-ack(recon)j 

new-config(c, k)i, c € C,k €N*~ 
init(c, c')k,i, c,c £ C,k £ N + , i G members(c) 
recon-ack(6)j, b 6 {ok,nok} 
report (c)i, c € C 
send({config, c, k))ij, c G C, k G N + , 

j G members(c) — {i} 
send((init, c, c',k))ij,c, c G C,& G N+, 

i, j G members(c), j ^ i 



cons-data G (N + — > (C x C)): initially _L everywhere 
rec-status G {idle, active}, initially idle 
outcome G {ok, nok,lS\, initially _L 
reported C C, initially 
failed, a Boolean, initially false 



Figure 11: Recon f Signature and state 

Location i joins the Recon service when a join (recon) input occurs. Reconi responds with a 
join-ack. 

Reconi includes a state variable rec-cmap, which holds a CMap: rec-cmap(k) = c indicates that 
i knows that c is the kth configuration identifier. If Reconi has learned that c is the &th configuration 
identifier, it can convey this to its local Reader-Writer i process using a new-config(c, k)i output 
action, and it can inform any other Recon j process, j € members(c), by sending a (config, c, k) 
message. Reconi learns about new configurations either by receiving a decide input from a Cons 
service, or by receiving a config or init message from another process. 



3 In [14], regular timing implies that messages are delivered reliably within time d, that local processing time is 0, 
and that information is "gossiped" at intervals of d. 
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Input join(recon), 
Effect: 

if -^failed then 
if status = idle then 
status «— active 

Output join-ack(recon), 
Precondition: 

-^failed 

status = active 
Effect: 

none 

Output new-config(c, k)i 

Precondition: 
-^failed 

status = active 
rec-cmap(k) = c 
k ^ did-new-config 

Effect: 

did-new-config «— 



■new-con 



•fig U {k} 



Output send((config,c, k))ij 
Precondition: 

-^failed 

status = active 

rec-cmap (k) = c 
Effect: 

none 

Input recv((config,c, k))j,i 
Effect: 

if -^failed then 
if status = active then 
rec-cmap (k) <— c 

Output report(c)i 
Precondition: 

^failed 

status = active 

c $ reported 

S = {£: rec-cmap (l) 6 C) 

c = rec-cmap (max(5)) 
Effect: 

reported <— reported U {c} 

Input recon (c,c')i 
Effect: 

if -^failed then 
if status = active then 
rec-status «— active 
let 5 = {^: rec-cmap (£) 6 C} 
if 5 ^ and c = rec-cmap (max(S)) 
and cons- data (max(S) + 1) = _L then 
cons-data (max(5) + 1) <— (c, c') 
else outcome <— nok 



Output init(c')j., c ,i 
Precondition: 

-^failed 

status = active 

cons- data (k) = (c,c) 

if k > 1 then k € did-new-config 

k $ did-init 
Effect: 

did-init <— did-init U {k} 

Output send((init, c, c', k))ij 
Precondition: 

-^failed 

status = active 

cons-data(k) = (c,c) 

k € did-init 
Effect: 

none 

Input recv((init, c, c', fc))j,i 
Effect: 

if -^failed then 
if status = active then 
if rec-cmap(k — 1) = _L then rec-cmap(k — 1) <— c 
if cons-data(k) = _L then cons-data(k) <— (c,c) 

Input decide(c')A; jCj j 
Effect: 

if -^failed then 
if status = active then 
rec-cmap(k) <— c 
if rec-status = active then 
if cons-data(k) = (c,c) then outcome <— ok 
else outcome «— noA; 

Output recon-ack(6)j 
Precondition: 

-^failed 

status = active 

rec-status = active 

b = outcome 
Effect: 

rec-status = idle 

outcome «— _L 

Input failj 
Effect: 

failed «— true 



Figure 12: Reconf. Transitions. 
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Recorii receives a reconfiguration request from its environment via a recon (c,c')j event. Upon 
receiving such a request, Recorii determines whether (a) i is a member of the known configuration 
c with the largest index k — 1 and (b) it has not already prepared data for a consensus for the 
next larger index k. If both (a) and (b) hold, Recorii prepares such data, consisting of the pair 
(c, c'}, where c is the k — 1st configuration identifier and c' is the proposed configuration identifier. 
Otherwise, Recorii responds negatively to the new reconfiguration request. 

Recorii initiates participation in a Cons(k,c) algorithm when its consensus data are prepared. 
After initiating participation in a consensus algorithm, it sends in it messages to inform the other 
members of c about its initiation of consensus. The other members use this information to prepare 
to participate in the same consensus algorithm (and also to update their rec-cmap if necessary). 
Thus, there are two ways in which Recorii can initiate participation in consensus: as a result of a 
local recon event, or by receiving an init message from another Recorij process. 

When Recorii receives a decide (c')/^ directly from Cons(k,c), it records configuration c' in 
rec-cmap It also determines if a response to its local client is necessary (if a local reconfiguration 
operation is active), and determines the response based on whether the consensus decision is the 
same as the locally-proposed configuration identifier. 

Each consensus service Cons(k, c) is responsible for conveying consensus decisions to members(c). 
The Recorii components are responsible for telling members(c') about c' by sending new-config mes- 
sages. 

Theorem 6.3 The Recon implementation guarantees well-formedness, agreement, and validity. 

7 Conditional Performance Analysis 

In this section we give a conditional latency analysis of the new algorithm, focusing on the im- 
provements realized by the aggressive configuration-upgrade mechanism. We show that the new 
algorithm allows the system to recover rapidly after a period of unreliable network connectivity or 
bursty reconfiguration. In particular, we prove that if configurations do not fail too rapidly, then 
progress is guaranteed. First, in Section 7.1, we present a few general definitions. In Section 7.2 
and 7.3, we define the executions being considered, and the environmental assumptions that these 
executions satisfy. Then in Sections 7.5, 7.6, and 7.7, we prove a series of lemmas that describe 
how long it takes configuration-upgrade operations to complete. Finally, in Section 7.8 we state 
the main stabilization theorem, and prove that operations will complete as long as the execution 
assumptions are met. Throughout this section, we compare the results with those proved in Section 
9 of the Rambo technical report [13]. 

7.1 Definitions 

In this section, we present a few basic definitions. These definitions do not depend on timing, but 
are needed only for the conditional performance analysis. For these definitions, assume that a is 
an execution. 

First we define what it means for a configuration to be installed: configuration c is installed when 
either of the following holds: (i) c = cq or (ii) for some k > 0, for all non-failed i G members(c(k— 1)), 
a decide(c)/fc,i event occurs in a. That is, configuration c = c(k) is installed when every non-failed 
member of configuration c(k — 1) performs a decide(c(&)) event. 

Next, we define an event that occurs when a configuration is guaranteed to be ready to 
be upgraded (though an upgrade operation may occur earlier than this event). We define the 
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upgrade-ready(&) event, for k > 0, to be the first event in a after which, \ll < k, the following hold: 
(i) configuration c(£) is installed, and (ii) Mi G members(c(k — 1)) such that i has not failed at the 
time of the event, cmap(£)i / _L. 

7.2 Limiting Nondeterminism 

The algorithm, as presented, is highly nondeterministic. Therefore for the purposes of analysis, 
we restrict our attention to a subset of executions in which automata follow certain timing-related 
rules. For the rest of this paper we assume a fixed constant d > 0. We assume that gossip occurs 
at fixed intervals of time d, and also that in times of good behavior messages are delivered within 
time d 4 . 

1. Each node, i G /, performs a sendjj for all j G worlds every time d as measured by the local 
clock of i. 

2. Each node, i G /, performs a sendjj (an "important" send) whenever any of the following 
occurs: 

• Just after a recvQoin)^ event occurs, if status j = active. 

• (Responses for messages) Just after a recv(*, *, *, *,pns, *)jj event occurs, if pns > 
pnum2(j)i and statusi = active. 

• Just after a new-config(c, k),- b event occurs if status j = active and j G worlds. 

• Just after a recv(*, *,*, cm, *, *)j : i event occurs, if op.phase^ / idle and for some k, 
cm(k) j^z _|_ and cmap(k)i = _L. 

• Just after a readj, write^, or query-fix^ event occurs, if j G members(c), for some c in the 
range of op.cmap^ 

• Just after a cfg-upgrade(&)j event occurs for configuration-upgrade 7, if j G members(cmap(k')i) 
for any k! G removal-set^). 

• Just after a cfg-upg-query-fix(&)j event occurs for configuration-upgrade 7, if j G members (cmap(k')i 
where k! = target^). 

3. Locally controlled actions of any automaton in the system that have no effects, other than 
the important sends described just above, are performed only once. 

4. If an action is enabled to occur at node i, and has not yet been performed (and therefore is 
not restricted by the previous rule), then it occurs immediately, with zero time passing. 

7.3 The Behavior of the Environment 

Much of the analysis in the original Rambo algorithm makes guarantees about the latency of 
requests when "normal behavior" holds. In Section 9 of [13], Lynch and Shvartsman begin to 
examine how the system behaves in executions that achieve normal behavior after some point. 
Here we adopt a similar model. We first define what it means for an execution to exhibit "normal 
behavior" from some point onward. 

For the rest of the paper, we use the following notation: given some time t G M-°, J(t,e,a) 
represents the set of all nodes j such that join-ackj occurs no later than time t — e — 2d in a. (Recall 



4 It seems, perhaps, that we should not be using d to represent both these quantities; however for consistency with 
the original Rambo presentation, we continue to use this convention. 
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> e + 2d t 



join-ack, • e J{t) 



Figure 13: Definition of J(t) 

that d has been fixed, above.) In most cases, we will use the notation J(t), when e and a are clear 
from the context. 

7.3.1 Normal Timing Behavior from Some Point Onward 

Let a be an admissible timed execution, and a' a finite prefix of a. Arbitrary behavior is allowed 
in a': messages may be lost or delivered late, clocks may run at arbitrary rates, and in general any 
asynchronous behavior may occur. However we assume that after a', good behavior resumes. We 
say that a is an a' -normal execution if the following assumptions hold: 

1. Initial time: The join-ackj o event occurs at time 0, completing the join protocol for node io, 
the node that created the data object. 5 

2. Regular timing: The local clocks of all Rambo II automata (i.e., Reader- Writer ■$, Reconi, Joiner \) 
at all nodes progress at exactly the rate of real time, after a'. 

3. Reliable message delivery: No message sent in a after a' is lost. 

4. Message delay bound: If a message is sent at time t in a and it is delivered, then it is delivered 
by time meL-x(t, £time(a')) + d. 

7.3.2 Configuration-Viability 

Next we will define configuration-viability, which is the key assumption needed to guarantee that 
read and write operations complete. As in all quorum-based algorithms, liveness depends on all 
the nodes in some quorums remaining alive. In RAMBO II, a node can make progress only if it is 
able to communicate with the read and write quorums of all extant configurations. We say that a 
configuration has failed when either: (i) some node in every read-quorum of the configuration has 
failed, or (ii) some node in every write-quorum of the configuration has failed. If a configuration 
fails before a new configuration is installed and the old configuration removed, then the system will 
be effectively crashed: no future read or write request will ever complete. In order to guarantee 
that operations complete, then, it is necessary for the client using the RAMBO II system to initiate 
appropriate reconfigurations to ensure that quorums remain accessible. The configuration viability 
assumption is a complex property, depending on the behavior of the algorithm, the client initiating 
appropriate reconfigurations, and on the patterns of node failure and message loss. 

We define what it means for an execution to be (a' , e, r ) -configuration-viable: Let a be an 
admissible timed execution, and let a 1 be a finite prefix of a. Let e, r € M-°. Then a is (a', e, t)- 
configuration-viable if the following holds: 

For all i,c, k such that cmap(k)i = c in some state in a, there exist R € read- quorums (c) and 
W G write- quorums (c) such that at least one of the following holds: 



3 This assumption was assumed implicitly in the initial Rambo papers, and was missing from the list of assumptions. 
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1. No process in R U W fails in a. 

2. There exists a finite prefix a>i nsta u of a such that for all £ < k + 1, configuration c(£) is installed 
in ai ns taii and no process in R U W fails in a by time max(^iime(o; / ) + e, £time(a>i ns tall)) + t. 

By assuming that an execution is (a' ,e,T )-configuration-viable , we ensure that the algorithm 
has at least time r after a new configuration is installed to clean up obsolete configurations. Also, 
since all configurations are viable until at least time e + r after a', the algorithm has at least time 
e + r after the system stabilizes to clean up obsolete configurations. 

7.3.3 Recon-Spacing 

While reconfigurations cannot impede a read/write operation, too frequent reconfigurations can 
slow down a read/write operation by introducing new quorums that must be contacted. In or- 
der to bound the time required for a read/write operation, we need to bound the frequency of 
reconfigurations . 

There are two components to Recon-Spacing. Let a be an a' -normal execution, and e G M-°. 
Then a satisfies: 

1. (a' ,e)-recon-spacing-l : if for any recon(c, *)$ event in a after a' the preceding report(c)j event 
occurs at least time e earlier. 

2. (a 1 ,e)-recon-spacing-2: if for any recon(c, *)$ event in a after a' there exists a write-quorum 
W G write- quorums (c) such that for all j G W, report(c)j precedes the recon(c, *)$ event in a. 

We say that a satisfies (a' ,e)-recon-spacing if it satisfies both (a' ,e)-recon-spacing-l and (a',e)- 
recon-spacing-2. 

Notice that, instead of assuming the second part of this requirement, we could instead modify 
the recon automaton to enforce this ordering: the automaton could collect gossip messages indi- 
cating which nodes had performed a report(c), and delay or abort the next recon if it preceded an 
appropriate set of report events. We choose to instantiate this as an assumption, rather than as a 
modification to the automaton for two reasons. First, we prefer to retain compatibility with the 
original Rambo analysis. Second, by stating this as an assumption, it is possible that the client 
using the Rambo II algorithm might choose to violate the second part of the assumption. As a 
result, those guarantees that depend on this assumption will not hold; however reconfigurations 
may be more performed more frequently. Even if the second part of this assumption is violated, 
safety is still guaranteed: atomicity is maintained, and read and write operations are guaranteed 
to terminate. However, operations might not terminate rapidly in 8d, as in Section 7.8. 

7.3.4 Join-Connectivity 

The hypothesis of join- connectivity is designed to ensure that all non-failing joining processes are 
able to learn about each other. Otherwise, it is possible for the processes to join and fail in such 
a way that the world-views of the nodes are partitioned into multiple components, with different 
nodes aware of different, disconnected pieces of the world. It is also important for the latency 
analysis to bound how long this process takes. If two nodes both complete the join protocol and 
do not fail, then they should learn about each other within a bounded time. For this reason, we 
define the notion of join- connectivity as follows: 
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Let a be an a! '-normal execution, e £ M-°. We say that a satisfies (a! ,e)- join- connectivity 
provided that: for any time t and nodes i, j € J(t, e, a), if neither i nor j fails until after max(t — 
2d,ltime(a>') + e), then by time max(i — 2d,ltime(a>') + e), « € world j. 

This indicates, then, that if two nodes both complete joining by some time t after a', then 
within time e the two nodes are aware of each other. If two nodes both complete joining by some 
time t during a', then within time e after a' the two nodes are aware of each other. 

Prior results on joining from [13] suggest that in some cases it can be shown that the current 
simple join protocol in the Rambo II algorithm provides (a',d + d\log(\J\)~\) -join- connectivity. 
However we will not prove - or depend on - this earlier result. Instead we will assume that the 
system provides (a' ,e)-join-connectivity for some e, and prove our results in this context. We leave 
it as an open problem to determine the exact value of e; a more complicated and interactive join 
protocol might well provide better results. 

7.3.5 Recon- Readiness 

The next assumption we make is related to the problem of reconfiguration by a node that has 
recently joined. We will assume that every node that is proposed to be a member of a configuration 
has been a member of the Rambo II system for a reasonable period of time. This allows us to 
conclude that everyone is aware of nodes that are part of active configurations. 

An a' -normal execution a satisfies (a', e)-recon-readiness if the following property holds: if for 
some node i and some configurations c and c', a recon (c,c')j event occurs in a at time t, then: 

• If j € members(c'), then j performs a join-ack prior to the recon event. 

• If the recon event occurs after a', and if j € members(c'), then j € J(t,e,a). 

This prohibits nodes that have just joined the system, but are not yet in anyone's world view 
from forming new configurations. As long as e is not too large, this seems a reasonable requirement. 

7.3.6 Upgrade-Readiness 

The last assumption we make ensures that a node initiates an upgrade operation only if it has 
joined sufficiently long ago. This ensures that when a node performs an upgrade, it has relatively 
up-to-date information. 

We say that an a' -normal execution a satisfies (a', e)-upgrade-readiness if the following prop- 
erty holds: if for some i a cfg-upgrade(*).; event occurs in a after a' at time t, then i € J{t). 

In particular, we suggest that in an implementation of this algorithm, only members of con- 
figuration c(k) initiate operations to upgrade configuration c(k). In this case, recon- readiness 
guarantees upgrade- readiness. 

7.3.7 Fixed Parameters 

We have already fixed d such that gossip occurs at fixed intervals of time d, and in times of good 
behaviour messages are delivered with time d. We now fix e as well. Additionally, for the rest of 
the paper, we fix a and a', and assume that a is an a'-normal execution. We therefore sometimes 
suppress these parameters, as they are clear from context. For example, we will use the notation 
J(t) to represent J(t,e,a>). When we refer to join-connectivity, we mean (a', e) -join-connectivity; 
recon-readiness is used to mean (a 1 , e)-recon-readiness; upgrade-readiness is used to mean (a',e)- 
upgrade- readiness; r-recon-spacing is used to mean (a', r)-recon-spacing; r-configuration-viability 
is used to mean (a', e, reconfiguration viability. 

40 



^ > e+2d , t ,' t 

join-ack^ recon(*,/i) i£j(t) 

Figure 14: Lemma 7.2, Case 1 



a' join-ack^ recon(*,/i) i£j(t) 

Figure 15: Lemma 7.2, Case 2 

7.4 Basic Lemmas 

In this section, we prove a few basic lemmas that will be useful in the rest of the paper. 
The following two lemmas demonstrate some basic facts about the sets </(*): 

Lemma 7.1 1. If t < t ', then J(t) C J(t'). 

2. For allt,t', J(t) C J(max(i,i'))- 

Proof. By definition of </(•). □ 

The following lemma uses the recon-readiness assumption to say something stronger about the 
joining time of members of a configuration: 

Lemma 7.2 Assume that a is an a' -normal execution satisfying (a' , e) -recon-readiness. If h is a 
configuration proposed at time t' by a recon(*,/i) event, t > t' , and t > Itime(a') + e + 2d, then 
members(h) C J(t). 

Proof. First, assume that t' > Itime(a'). Then the result follows immediately by recon-readiness 
and Lemma 7.1. Assume, then, that t' < Itime(a'). By recon-readiness, every member of configura- 
tion h performs a join-ack by £time(a'). Therefore, by definition of J, members(h) C J(£time(a>') + 
e + 2d). Since t > ltime(a>') + e + 2d, Lemma 7.1 implies that J(£time(a') + e + 2d) C J(t). D 

The next lemma shows a similar result about upgrade- readiness: 

Lemma 7.3 Assume that a is an a' -normal execution satisfying (a', e) -upgrade-readiness. If a 
cfg-upgrade(*)j event occurs in a at time t, for some node i, then i € J(max(i, itime(a') + e + 2d)). 

Proof. First, assume that the cfg-upgrade event occurs after a'. Then the lemma follows imme- 
diately by the definition of upgrade-readiness and Lemma 7.1. Assume, then, that the cfg-upgrade 
event occurs in a' . By the precondition of cfg-upgrade, i must perform a join-ack prior to the 
cfg-upgrade event; otherwise statusi / active when the cfg-upgrade occurs, which contradicts the 
precondition of the cfg-upgrade. Therefore i performs a join-ackj at latest at time £time(a>'), and 
therefore i G J(£time(a') + e + 2d), and the lemma again follows by Lemma 7.1. □ 
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7.5 Propagation of Information 

In this section, we introduce the notion of information being in the "mainstream" . Once a sufficient 
set of nodes know a particular fact, then, under appropriate assumptions, this fact will never be 
forgotten by the system as a whole. In particular, we show that this is true about information in 
the cmap: updates to the cmap are propagated. Once every non-failed node in J(t) updates its 
cmap, then at any time in the future, at time t' > t + 2d, every non-failed node in J(t') will be 
aware of this update. 

If cm is a CMap and (3 is a finite prefix of a with £time(j3) = t > e + 2d, then we say that cm 
is mainstream after (3 provided that the following holds: For every i G J(t) such that failj does not 
occur in /3, cm < £state()3) .cmap j. 

Further, we define the following notation: given an execution a and a time t G M-°, we define 
/3(t, a) to be the finite prefix of a such that £time(/3(t, a)) = t and every event that occurs at time t 
occurs in /3(t, a). As we have already fixed a, for the rest of this paper we use the simpler notation 
of /3(t). We then say that a CMap cm is mainstream after t if it is mainstream after /3(t). 

The first lemma shows a basic property of mainstream cmaps: 

Lemma 7.4 Assume that a is an execution, t is a time, and cm, cm2 are CMaps. If cm < cm2, 
and cm2 is mainstream after t, then cm is mainstream after t. 

Proof. Immediate from the definition of mainstream. □ 

The following lemma shows that a node's cmap is monotone: 

Lemma 7.5 Assume that a" is a finite prefix of execution a, and that a'" is a prefix of a" . Assume 
that i is a node. Then £state(a"').cmap i < £state(a").cmap i . 

Proof. In the algorithm, cmap i is only modified by the update function, and the update function 
is monotone; that is, for all CMaps new-cmap, cmap < update(cmap, new-cmap). □ 

Lemma 7.6 Assume that a is an execution, and t and t' are times, and that t < t' . Assume that 
i is a node, and cm, is a CMap. 

1. If cm < istate^^^.cmapi, then cm < lstate{fi{t , y).cmap i . 

2. 1st ate {/3(t)). cmap i <lstate{fi{t , y).cmap i . 

Proof. This follows by Lemma 7.5, where a'" = /3{t) and a" = /3{t'). □ 

Next, we demonstrate a particular case when a cmap becomes mainstream. 

Lemma 7.7 Let a be an a' -normal execution satisfying («',e)-join-connectivity. Let t be a time 
such that t > £time(a') + e. If i € J(t + 2d), and i does not fail in f3(t + d), then £state(/3(t)). cmap i 
is mainstream after t + 2d. 

Proof. Let cm = £state(/3(t)) .cmap i . To show that cm is mainstream after t + 2d, we need to 
show that for all j G J(t + 2d) such that j does not fail in /3(t + 2d), cm < £state(/3(t + 2d)). cmap j. 
Fix any such j. By join-connectivity, j G worlds by time max(i, £time(a>') + e) < t. 

By time t + d, i sends a gossip message, msg, to node j such that cm < msg.cmap^ By time 
t+2d, j receives the gossip message and updates cmapj with msg.cmap. By the monotonicity of the 
update function, msg.cmap < update(cmapj, msg.cmap). Therefore cm < £state(/3(t + 2d)).cmapj, 
as required. □ 

42 



^ ^e ^ t t + 2d 

a 1 join-ack^ a 1 + e y fail, 

£state(/3 (t)) .cmap i mainstream after t + 2d 

Figure 16: Lemma 7.7 
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Figure 17: Lemma 7.9 

The following lemma shows that if two nodes are both in the set J(t + 2d), then information is 
propagated from one to the other. 

Lemma 7.8 Let a be an a' -normal execution satisfying («',e)-join-connectivity. Assume that t 
andt' are times, and t' ' — 2d >t> £time(a') + e. Assume thati and j are nodes, andi,j € J(t + 2d). 
Also, assume that i does not fail in j3{t + 2d), and j does not fail in /3(t'). 
If cm < £state(/3(t)) .cmap i} then cm < £state(/3(t')).cmapj. 

Proof. By Lemma 7.7, lstate{fi{ty).cmap i is mainstream after t + 2d. Notice that j G J(t + 2d), 
and therefore, by the definition of mainstream, £state(j3(t)).cmap i < £state(j3(t + 2d)).cmapj. Since 
t + 2d < t' , by Lemma 7.6, £state(j3(t + 2d)).cmapj < £state((3(t')).cmapj. Putting the inequalities 
together, cm < £state(j3(t')).cmap j. □ 

We now show that once a cmap is in the mainstream, after 2d it will always be in the mainstream. 
First, Lemma 7.9 considers a special case: it considers only times t' after the system has stabilized, 
when a recon(/i, h') event occurs. Second, Lemma 7.10 handles the case where the cmap is in the 
mainstream at a time in a' . Third, Lemma 7.11 proves the existence of a configuration with some 
necessary special properties to prove the main theorem. Finally, Lemmas 7.12 and 7.13 prove the 
general result, as summarized in Lemma 7.14. 

First, we define a successful recon event as follows: a recon(*,c) event is successful if at some 
time afterwards a dec\de(c)k,i event occurs for some k and i. 

Lemma 7.9 Let a be an cJ -normal execution satisfying: (i) («',e)-join-connectivity, (ii) (a' , e)- 
recon-readiness, (Hi) (a', 2d)-recon-spacing-l, and (iv) (a', e, 2d)-configuration-viability. 

Assume that t and t' are times, and that t > £time(a') + e + 2d and t' > t. Let h and h' be two 
configurations, and assume that recon (h, /t')* occurs at time t' , and that this is a successful recon 
event. 

If cm is mainstream after t, then cm is mainstream after t' + 2d. 
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Figure 18: Lemma 7.10 

Proof. Fix t and cm such that cm is mainstream after t. We prove the result by induction on 
the number of successful recon events that occur at or after time t. 

As the base case, consider the first successful recon (h, h') event that occurs in a at a time t' > t. 
We need to show that cm is mainstream after t' + 2d. Therefore fix some f € J(t' + 2d) such that 
fa My does not occur in /?(£' + 2d). We will show that cm < £state(/3(t' + 2d)).cmapj,. 

Choose some node j € members(h) such that j does not fail in /3(t' + 2d); that is, j does not fail 
until after t' + 2d. Configuration- viability ensures that such a node exists. Notice that j € J{t), by 
Lemma 7.2. Since cm is mainstream after t, then cm < £state((3).cmapj. 

Note that configuration h is proposed prior to time t, since the recon (h, h') event is the first 
successful recon event at or after time t. Therefore configuration h is also proposed prior to time 
t'. By Lemma 7.1, j € J(t' + 2d). By assumption j' € J(t' + 2d) and does not fail in /3(t' + 2d). 
Therefore, by Lemma 7.8, cm < £state((3(t' + 2d)).cmapj,, as needed. 

Next we show the inductive step. Inductively assume the following: if recon (*, *) is one of the 
first n successful recon events in a that occur at time t' > t, then cm is mainstream after t' . 

Consider the (n + l) st successful recon(/j,/j') event in a that occurs at or after t. Assume 
this event occurs at time t'. We need to show that cm is mainstream after t' + 2d. Therefore 
fix some j' € J(t' + 2d) such that failj' does not occur in /3(t' + 2d). We will show that cm < 
£state(/3(t' + 2d)).cmapj,. 

Let p be the n th successful recon(*,/i) event, and assume that p occurs at time t\. Note that 
the first argument of the (n + l) st successful recon event must be the configuration proposed by the 
n th successful recon event. 

2d-recon-spacing-l guarantees that t' > t\ + 2d. The inductive hypothesis shows that cm, is 
mainstream after t\ + 2d. 

Choose some node j G members(h) such that no failj occurs in /3(t'+2d). Configuration- viability 
ensures that such a node exists. By recon-readiness and Lemma 7.1, j € J(t' + 2d). By assumption 
j 1 € J(t' + 2d) and j' does not fail in /3(t' + 2d). By Lemma 7.8, cm < £state(/3(t' + 2d)).cmapji, 
as needed. □ 

The next lemma considers the case where a cmap is mainstream in a' or soon after, and shows 
that it is mainstream after itime(a') + e + Ad. 

Lemma 7.10 Let a be an a' -normal execution satisfying (i) (of',e)-join-connectivity, (ii) (a' , e)- 
recon-readiness, (Hi) (a', 2d)-recon-spacing-l, and (iv) (a', e, 4d)-configuration-viability. 

Assume that t is a time and that e + 2d < t < £time(a') + e + 2d. If cm is mainstream after t, 
then cm is mainstream after itime(a') + e + Ad. 

Proof. Consider configuration cq. By configuration- viability, there exists a read-quorum, R € 
read- quorums (co), and a write-quorum, W € write- quorums (co) such that no node in R U W fails 
by £time(a') + e + Ad. 
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Let t\ = Itime(a') + e + 2d. Consider io £ RU W; «o does not fail by ltime(a') + e + 4d. Since 
«o performs a join-ack at time 0, by the assumption that a is an a' -normal execution, and since 
t > e + 2d, «o £ «A*)- Also note that by Lemma 7.6, «o £ «J(*i)- 

Since cm is mainstream after i, cm < £state(/3(t)).cmap io . Therefore, we know by Lemma 7.6 
that cm < £state(/3(ti)).cmap iQ . By Lemma 7.7, we know that £state(/3(ti)).cmap io is mainstream 
after t\ + 2d. Therefore by Lemma 7.4, cm is mainstream after t\ + 2d; that is, cm is mainstream 
after Itime(a') + e + id. □ 

The next lemma shows the existence of a certain configuration, h', with some particular prop- 
erties. This will be useful in proving Lemma 7.14. 

Lemma 7.11 Let a be an a' -normal execution satisfying: (i) (c/,e)-join-connectivity, (ii) (a 1 , e)- 
recon-readiness, (Hi) (a', 2d)-recon-spacing-l, and (iv) (a', e, 4d)-configuration-viability. 

Assume that t and t' are times. Assume that £time(a>') + e + 2d < t < t' — 2d and £time(a') + 
e + 6d < t' . Assume that cm, is mainstream after t. Then there exists a configuration h, with index 
k, with the following properties: 

1. members(h) C J(i'). 

2. For all members i of configuration h that do not fail in (3(f), cm < £state(j3(t' — 2d)).cmap i . 

3. No successful recon(/i,*) event occurs in j3(t' — 4d). 

Proof. There are three different sub-cases to consider. 

1. No successful recon event occurs in f3(t' — 4d): 

Let h = cq. Notice that members(h) C J(i), since «o (the only member of cq) completes a 
join-ack at time (by assumption on a), and t > £time(a') + e + 2d. This, then, implies Prop- 
erty 1 by Lemma 7.1. Since «o £ J(t) and cm is mainstream after t, cm < £state(/3(t)).cmap iQ . 
Therefore, since t < t' — 2d, by Lemma 7.6, cm, < £state(j3(t' — 2d)).cmap io , as required for 
Property 2. Property 3 holds trivially. 

2. A successful recon event occurs in /3(t' — 4d) after time t: 

Consider the last successful recon event in a that occurs in j3(t' — 4d); let h be the configuration 
identifier appearing as the second argument in this recon event. Assume that this recon event 
occurs at time t rec . Note that t < t rec < t' — 4d. Therefore (since t' > £time(a') + e + 6d 
and t' > t rec ) by Lemma 7.2, members(h) C J(i'), as required for Property 1. Since t rec > i, 
Lemma 7.9 shows that cm is mainstream after t rec + 2d. Recall that t rec + 2d < t' — 2d. By the 
mainstream property, for every member, «, of configuration h that does not fail in f3(t' — 2d), 
cm < £ state ((3(t rec + 2d)).cmap i \ therefore, for each of these members, «, by Lemma 7.6, 
cm, < £state(j3(t' — 2d)).cmap i , as required for Property 2. Property 3 holds by the selection 
of the last successful recon event in f3(t' — 4d). 

3. Neither Case 1 nor Case 2 holds, that is, a successful recon event occurs in f3(t' — 4d), but no 
such recon event occurs after time t: 

Consider the last successful recon event in a that occurs in j3(t' — 4d); let h be the configuration 
identifier appearing as the second argument in this recon event. Assume that this recon 
event occurs at time t rec . Notice, then, that t rec < t. (Otherwise, Case 2 would hold.) 
Since t > £time(a') + e + 2d, then by Lemma 7.2, members(h) C J(t). By Lemma 7.6, 
then, members(h) C J(i'), which implies Property 1. Since cm is mainstream after t (and 
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Figure 19: Lemma 7.12 

members(h) C </(£)), for all j € members(h) such that no failj event occurs in /?(£), cm < 
£state(/3(t)).cmapj. Since t <t' — 2d, by Lemma 7.6, for all j such that no failj event occurs 
by time t' — 2d, cm < £state(/3(t' — 2d)). cmap ^ as required for Property 2. Property 3 holds 
by the selection of the last successful recon event that occurs in /3(t' — Ad). 

U 

Finally we prove the main lemma of this section, showing that if a cmap is mainstream at 
time t, then the cmap is also mainstream at times t' > t + 2d. There are two cases to consider: (i) 
t > £time(a') + e + 2d, and (ii) t < £time(a') + e + 2d. Lemma 7.12 shows the first case, Lemma 7.13 
shows the second case, and Lemma 7.14 presents the overall conclusion. 

Lemma 7.12 Let a be an a' '-normal execution satisfying (i) (of',e)-join-connectivity, (ii) (a' , e)- 
recon-readiness, (Hi) (a 1 , 2d)-recon-spacing-l, and (iv) (a 1 , e, 4d)-configuration-viability. 

Assume that t and t' are times. Assume that e + 2d < t < t' — 2d and £time(a') + e + 6d < t' . 
Additionally assume that t > £time(a') + e + 2d. If cm is a mainstream CMap after t, then cm, is 
mainstream after t' . 

Proof. By assumption, t > £time(a')+e+2d. Lemma 7.11 shows that there exists a configuration, 
h, with index k with the following three properties: 

1. members(h) C J(i'). 

2. For all members i of configuration h that do not fail in /3(i'), cm < £state(j3(t' — 2d)).cm J ap i . 

3. No successful recon(/i, *) event occurs in f3(t' — 4d). 

Configuration- viability guarantees that some node of configuration h does not fail until after the 
next configuration is installed. No successful recon (h, *) event occurs in f3(t' — Ad) , by Property 3. 
Therefore some node, j G members(h) does not fail in /3(i') (and therefore does not fail in /?(£' — d)), 
by 4d-configuration- viability. By Property 1 of h, node j G J(t')- Therefore, by Lemma 7.7, 
£state(j3(t' — 2d)). cmap j is mainstream after t'. 

Further, we know by Property 2 that cm < £state(j3(t' — 2d)). cmap j. Therefore by Lemma 7.4, 
cm is mainstream after t' . □ 

The following lemma considers the case where t < £time(a') + e + 2d: 

Lemma 7.13 Let a be an a' -normal execution satisfying (i) (c/,e)-join-connectivity, (ii) (a' , e)- 
recon-readiness, (Hi) («', 2d)-recon-spacing-l, and (iv) (a', e, 4d)-configuration-viability. 

Assume that t and t' are times. Assume that e + 2d < t < t' — 2d and £time(a') + e + 6d < t' . 
Additionally, assume that t < £time(a') + e + 2d. If cm is a mainstream CMap after t, then cm is 
mainstream after t' . 
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Figure 20: Lemma 7.13 

Proof. By assumption, t < Itime(a') + e + 2d. Let t\ = Itime(a') + e + 2d. By Lemma 7.10, cm 
is mainstream after t\ + 2d. By assumption, t\ + 2d < t' — 2d, and itime(a') + e + 2d < t\ + 2d. By 
Lemma 7.12, however, we know that since cm is mainstream after t\ + 2d, then era is mainstream 
after t'. □ 

The following lemma combines the previous two lemmas into a single conclusion. This lemma is 
the main result of this section, and is used throughout the rest of the proof. 

Lemma 7.14 Let a be an a' -normal execution satisfying (i) (a',e)-join-connectivity, (ii) (a' , e)- 
recon-readiness, (Hi) (a 1 , 2d)-recon-spacing-l, and (iv) (a 1 , e, 4d)-configuration-viability. 

Assume that t and t' are times. Assume that e + 2d < t < t' — 2d and itime(a') + e + 6d < t' . 
If cm is a mainstream CMap after t, then cm is mainstream after t' . 

Proof. By Lemmas 7.12 and 7.13. □ 

7.6 Upgrade-Ready Viability 

In this section, we show the relationship between a configuration being upgrade-ready, and a configu- 
ration being viable. In particular, we prove that if an execution a is (a' ,e,22d)-configuration-viable, 
then configuration c(k) is viable until at least 15d after the upgrade-ready(c(& + 1)) event. 

The first lemma shows that soon after a configuration is installed, every node that joined a 
while ago learns about the new configuration. 

Lemma 7.15 Let a be an a' -normal execution satisfying: (i) (e/,e)-join-connectivity, (a) ( ft ; , e )- 
recon-readiness, (Hi) (a', e, 4d)-configuration-viability. 

Assume that t G R-° is a time, and configuration c(k) is installed at time t. Then there exists 
a CMap, cm, such that cm(k) / _L ; and cm is mainstream after msix(t,£time(a') + e) + 2d. 

Proof. We first find a node j G members (c(k — l)) such that j G J(max(i, ttime(a') + e) + 2d) and 
j does not fail in /3(max(i, itime(a') + e) + d). Configuration-viability guarantees that there exists 
a read-quorum R G read- quorums (c(k — 1)) and a prefix a" of a such that c(k) is installed in a and 
no node in R fails by max(^iime(o;") , Itime(a') + e) + Ad. Since configuration c(k) is installed at 
time i, we know that t < £time(a"), and therefore no node in R fails by max(i, itime(a') + e) + 4d. 
Therefore no node in R fails in (3(msi-x(t,£time(a') + e) + d). Choose some node j G R. 

Assume that configuration c(k — 1) is proposed at time t rec . We next apply Lemma 7.2 where 
h = c(k — 1), t' = t rec , and t = max(i, £time(a') + e) + 2d: 

• max(i, £time(a')+e)+2d > t rec : c(k—l) is proposed at t rec < i, since c(k—l) must be proposed 
prior to configuration c(k — 1) being installed, which must occur prior to configuration c(k) 
being installed; t < msix(t,£time(a') + e) + 2d. 
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• max(£, ltime(a') + e) + 2d > itime(a') + e + 2d: Immediate. 

We therefore conclude that members(c(k — 1)) C J(max(£, Itime(a') + e) + 2d). Therefore we 
have shown that j G members(c(k — 1)), j G J(max(£, £time(a') + e) + 2d), and j does not fail in 
/3(max(£, £time(a') + e) + d). 

Since configuration c(k) is installed at time t and j G members(c(k—l)), £state((3(t)).cmap(k)j / 
_L, by the definition of a configuration being installed, and therefore (by Lemma 7.6) £state((3(meL-x(t, £time(a')+ 
e))).cmap(k)j / _L. We let cm = £state((3(msi-x(t,£time(a') + e))).cmap(k)j: cm(k) ^ _L, as re- 
quired. 

We next apply Lemma 7.7, where t = m&-x(t,£time(a') + e) and i = j: 

• msix(t,£time(a') + e) > £time(a') + e: Immediate. 

• j G J(max(i, £time(a') + e) + 2d): Shown above. 

• j does not fail in P(max(t,£time(a') + e) + d): Shown above. 

We therefore conclude that £state((3(msix(t, £time(a l )+e))).cmap i is mainstream after max(i, £time(a')+ 
e) + 2d, that is, cm is mainstream after max(i, £time(a') + e) + 2d. □ 

The next lemma shows that soon after smaller configurations are installed, a configuration is 
upgrade-ready. 

Lemma 7.16 Let a be an a 1 -normal execution satisfying: (i) (of',e)-join-connectivity, (ii) (a 1 , e)- 
recon-readiness, (Hi) (a 1 , 2d)-recon-spacing-l, and (iv) (a 1 , e, 4d)-configuration-viability. 

Let c be a configuration with index k, and assume that for all £ < k, configuration c(£) is 
installed in a by time t. 

Then upgrade-ready(A;) occurs in /3(max(i, £time(a') + e) + 6d). 

Proof. For every configuration c(£) with index £ < k, let ti be the time at which configuration 
c(£) is installed. Therefore t > max(ij). 

We first show that for all £ < k, there exists a CMap, cmi such that cmi(£) / _L and cmi is 
mainstream after max(t, £time(a') + e) + 6d. Fix some £ < k. 

Lemma 7.15, where t = ti and k = £, shows that there exists a CMap, cmi, such that cmi(£) ^ _L 
and cmi is mainstream after time msix(ti, £time(a') + e) + 2d. 

We next apply Lemma 7.14, where t = max(i^, £time(a') + e) + 2d and t' = meL-x(t,£time(a') + 
e)+6d: 

• max(i£, £time(a') + e) + 2d > e + 2d: Immediate. 

• max(i£, £time(a') + e) + 2d < max(i, £time(a') + e) + 6d — 2d: We know that ti < i, and 
£time(a') + e + 2d< £time(a') + e + Ad. 

• max(i, £time(a') + e) + 6d > £time(a') + e + 6d: Immediate. 

• cm( is mainstream after max(t^, £time(a') + e) + 2d: Shown above. 

We therefore conclude that cmi is mainstream after max(i, £time(a') +e) +6d. We have thus shown 
that for all £ < k, there exists a CMap, cmi such that cmi(£) / _L and cmi is mainstream after 
max(i, £time(a') + e) + 6d. 

Recall that upgrade-ready(&) is designated as the first event after which (i) all configurations 
with index < k have been installed, and (ii) for all £ < k, for all members of configuration c(k — 1) 
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that do not fail prior to the upgrade event, cmap(£) / _L. The first component occurs by time t, 
and therefore by time max(£, Itime(a') + e) + 6d, by assumption. 

We therefore need to show the second part. Fix some node j G members(c(k — 1)) such that 
j does not fail in /3(max(£, ttime(a') + e) + 6d). Fix some £ < k. We apply Lemma 7.2, where 
h = c(k — 1), t = ma-x(t,£time(a') + e) + 6d, and t' is the time at which configuration c(k — 1) is 
proposed: 

• max(i, £time(a') + e) + 6d is > the time at which configuration c(k — 1) is proposed: c(k — 1) 
is proposed prior to time i^-i (the time at which configuration c(k — 1) is installed), which 
is < time t < max(i, £time(a') + e) + 6d. 

• max(i, £time(a') + e) + 6d > £time(a') + e + 2d: Immediate. 

We therefore conclude that members(c(k — 1)) C J(max(i, £time(a') + e) + 6d), and therefore 
j G J(msix(t,£time(a') + e) +6d). 

We know from above that cmi is mainstream after max(i,^iime(«') + e) + 6d, which implies, 
by the definition of being mainstream, that cm,£ < £state((3(msix(t, £time(a') + e) + 6d)).cmap(£)j. 
This in turn implies that £state((3(msix(t,£time(a')+e)+6d)).cmap(£)j ^ _L, as required. Therefore 
upgrade-ready(A;) occurs in /3(max(i, £time(a') + e) + 6d). D 

The next lemma directly relates the time when all quorums of configuration c(k — 1) fail to the 
time at which upgrade-ready(A;) occurs. 

Lemma 7.17 Let a be an a' -normal execution satisfying: (i) (of',e)-join-connectivity, (ii) (a 1 , e)- 
recon-readiness, (Hi) (of',2d)-recon-spacing-l, and (iv) (a', e, 22d)-configuration-viability. 

Let c be a configuration with index k, and assume that the upgrade-ready(A;) event occurs at time 
t. Then there exists a read-quorum, R, and a write- quorum, W , of configuration c(k — 1) such that 
no node in R U W fails in /3(max(i, £time(a') + e) + 16c?). 

Proof. Let a" be the shortest prefix of a such that every configuration with index < k is installed 
in a. Let t' = £time(a"). Notice that for all £ < k, configuration c(£) is installed in /3(i'). 

Lemma 7.16, where t = t' and c and k are as defined above, shows that the upgrade-ready(A;) 
event occurs in (3(msix(t',£time(a') + e) + 6d), that is, t < max(i' ', £time(a') + e) + 6d. 

Configuration-viability guarantees that there exists a read-quorum, R, and a write-quorum, W, 
of configuration c(k — 1) such that either (1) no process in R U W fails in a, or (2) there exists 
a finite prefix, a,i ns t a u of a such that for all £ < k, configuration c(£) is installed in ai ns t a ii and 
no process in R U W fails in a by time meix(£time(a>i nsta ii),£time(a') + e) + 22d In the former 
case, we are done. We now consider the second case. Since a" is the shortest prefix of a such 
that every configuration with index < k is installed, we know that a" is a prefix of ctinstalh an d 
therefore t' = £time(a") < £time(ai nsta ii)- Therefore we know that there exists a read-quorum, 
R G read- quorums (c(k — 1)), and a write-quorum, W G write- quorums (c(k — 1)), such that no node 
in R U W fails by time max(i', £time(a r ) + e) + 22d. 

Then, max(i,^iime(a')+e) + 16d < max(i' , £time(a')+e)+22d, and as a result, no node in RUW 
fails by time max(i, £time(a') + e) + 16c?. That is, no node in R U W fails in (3(msi-x(t,£time(a') + 
e) + 166?). □ 

The final lemma shows that if no upgrade-ready(A;) occurs in a, then configuration c(k — 1) is always 
viable. 



49 



Lemma 7.18 Let a be an a 1 -normal execution satisfying: (i) (a',e)-join-connectivity, (ii) (a', e)- 
recon-readiness, (Hi) (a', 2d)-recon-spacing-l, and (iv) (a 1 , e, 4d)-configuration-viability. 

Let c be a configuration with index k, and assume that no upgrade-ready(A; + 1) event oc- 
curs in a. Then there exists a read-quorum, R G read- quorums (c) , and a write- quorum, W G 
write- quorums (c), such that no node in R U W fails in a. 

Proof. Assume that for some £ < k + 1, configuration c(£) is not installed in a. By the definition 
of configuration- viability, then, there exists a read-quorum, R G read- quorums (c), and a write- 
quorum, W G write- quorums (c), such that no node in R U W fails in a. 

Assume, instead, that for every £ < k + 1, configuration c(£) is installed in a. Then by 
Lemma 7.16, an upgrade-ready(A; + 1) event occurs in a, contradicting the hypothesis. □ 

7.7 Configuration-Upgrade Latency Results 

In this section we show that configuration-upgrade operations terminate rapidly, and that any ob- 
solete configuration is rapidly removed. In particular, these results hold in executions that include 
periods of bad behavior. The configuration-upgrade mechanism in Rambo does not make these 
guarantees. The original Rambo latency analysis required the assumption of (a', ^-configuration- 
viability 6 for the entire execution. This is an unrealistic assumption in a long-lived dynamic sys- 
tem. As a result of the new configuration-upgrade mechanism, we need to assume only bounded 
configuration- viability to ensure liveness. 

First we state a lemma about configuration-upgrade after the system stabilizes and good be- 
havior resumes. 

Lemma 7.19 Let a be an a' -normal execution. Let t G R-° be a time. Let i be a node that does 
not fail until after max(i, £time(a') + d) + 4d. 

Assume a cfg-upgrade(A;)j event occurs in a at time t. Additionally, assume that for every 
configuration c(£) such that upg.cmap(£)i G C, there exists a read-quorum, Ri, and a write-quorum, 
Wt, of configuration c(£) such that no node in Ri U Wi fails by time t + 3d. 

Then a cfg-upgrade-ack(&)j event occurs no later than t + Ad. 

Proof. There are two cases to consider. 

Case 1: t > £time(a'). At time t, node i begins the configuration- upgrade, with phase- number 
pi = upg.pnurrii. By triggered gossip, node i immediately sends out messages to every node 
in worlds. Therefore for every configuration c(£) such that upg.cmap(£)i G C, every node 
j G Ri U Wi receives a message by time t + d. 

By triggered gossip, then, each of these nodes sends a response with phase- number p\. Each 
response is received by time t + 2d, at which point a cfg-upg-query-fix(A;)j event occurs. Node 
i then chooses a new phase-number, p2, and sets upg.pnum^ = P2- 

Immediately, by triggered gossip node i sends out messages to every process in worlds, includ- 
ing every node in Ri U Wi, for every configuration c(£) such that upg.cmap(£)i G C. Again, a 
response is sent by time t + 3d, and node i receives a response from each with phase-number 
P2 by time t + 4d. Immediately, then, a cfg-upg-query-fix(&) event occurs. This is followed by 
a cfg-upgrade-ack(&), proving our claim. 



6 Although we have not formally defined (a , oo)- configuration-viability here, one can understand it to mean (a , e) 
configuration-viability for arbitrarily large e. 
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Case 2: t < Itime(a'). At time £, node i begins the configuration- upgrade, with phase- number 
p\ = upg.pnurrii. By occasional gossip, i sends out messages to every node in worlds. There- 
fore for every configuration c(£) such that upg.cmap(£)i G C, every node j G Ri U Wi receives 
a message by time max(i, £time(a') + d) + d. 

By triggered gossip, then, each of these nodes sends a response with phase- number p\. Each 
response is received by time max(i, £time(a') + d) + 2d, at which point a cfg-upg-query-fix(A;),; 
event occurs. Node i then chooses a new phase-number, p2, and sets upg.pnurrii = Vi- 

Immediately, by triggered gossip node i sends out messages to every process in worlds, includ- 
ing every node in RgUWg, for every configuration c(£) such that upg.cmap(£)i G C. Again, a re- 
sponse is sent by time max(i, £time(a') +d) +3d, and node i receives a response from each with 
phase-number ^2 by time msix(t,£time(a')) + 4d. Immediately, then, a cfg-upg-query-fix(&) 
event occurs. This is followed by a cfg-upgrade-ack(&), proving our claim. 

□ 

Next, we provide a conditional guarantee that a configuration is viable: if for some time t every 
earlier cfg-upgrade operation completes rapidly within 4d, then every configuration that is extant 
at time t will remain viable until t + 3d. 

We do this in four steps. First, Lemma 7.20 demonstrates that a node with certain good 
properties exists. Second, Lemma 7.21 shows that this certain node with good properties will 
begin an upgrade operation, in certain situations. Third, Lemma 7.22 shows that soon after a 
configuration is upgrade-ready(fc), some node completes an upgrade operation on configuration 
c(k). Finally, Lemma 7.23 uses these preliminary lemmas to show that under certain conditions, 
configurations remain viable sufficiently long. 

Lemma 7.20 Let a be an a' '-normal execution satisfying (i) (a', e)-join-connectivity, (ii)(a' , e)- 
recon-readiness, (Hi) (a', e)-upgrade-readiness, (iv) (a 1 , 2d)-recon-spacing-l, (v) (a', e, 22d)- 
configuration- viability . 

Assume that an upgrade-ready(&2) event occurs at time t for some configuration c.2 and assume 
that &2 > 1. Let k\ = k^ — 1, and c\ = c(k\). Then there exists a node i such that the following 
hold: 

1. i is a member of configuration c\, 

2. i does not fail in /3(max(t, £time(a') + e + d) + lOd), 

3. i G J(max(i, £time(a') + e + d) + 8d), 
4- i G J(meL-x(t,£time(a') +e + 2d)), 

5. i performs a join-ack prior to the upgrade-ready^) event in a. 

Proof. Lemma 7.17, applied with c = C2, k = &2, and t as defined above, implies that there exists 
a read-quorum, R, of configuration c\ such that no member of R fails in /3(max(i, £time(a') + e) + 
16c?). Then we know that no member of R fails in /3(max(i, £time(a') + e + d) + 14d). We therefore 
choose a node i G R C members(c\). We know that i does not fail in /3(max(£, £time(a>') +e + d) + 
lOd). This i satisfies Parts 1 and 2. 

Let t Cl be the time at which configuration c\ is proposed. Notice that max(i, £time(a') + e + 
2d) > t Cl , because t, the time of the upgrade-ready (fo), cannot be smaller than i Cl , the time at 
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which configuration c\ is proposed (since an upgrade-ready(&2) event cannot occur until after a 
recon(ci,C2) event, which cannot occur until after a recon(*,ci) event). Therefore, Lemma 7.2, 
applied where h = c\, t' = t Cl , and t = max.(t, Hime^) + e + 2d), guarantees that members(c\) C 
J(max(i, £time(a')+e+2d)). Since? £ members (ci), we know that i € J(msix(t,£time(a')+e+2d)), 
satisfying Part 4. 

Since max.(t,Hime(o/) + e + 2d) < max(i, £time(a>') + e + d) + lOd (since Itime(a') + e + 
2c? < Itime(a') + e + lOd), Lemma 7.1, applied where t = max(t,£time(a') + e + 2d) and t' = 
max(i, itime(a') + e + d) + lOd, implies that J(max(i,^iime(o; / ) + e + 2d)) C J(max(i, Itime(a') + 
e + d) + lOd), and thus i € J(max(i, Itime(a') + e + d) + lOd), satisfying Part 3. 

Finally, notice that recon-readiness requires that i performs a join-ack prior to the recon(*,ci) 
event, and therefore prior to the cfg-upgrade(&2) event. This satisfies Part 5. □ 

The next lemma claims that when a configuration is upgrade-ready, and a node with certain 
properties (as in Lemma 7.20) exists, then either the configuration is removed or an upgrade 
operation begins. 

Lemma 7.21 Let a be an a' '-normal execution satisfying (i) (a', e)-join-connectivity, (ii)(a', e)- 
recon-readiness, (Hi) (a 1 , e)-upgrade-readiness, (iv) (a 1 , 2d)-recon-spacing-l, (v) (a 1 , e, 22d)- 
configuration- viability . 

Assume upgrade-ready(&2) occurs at time t and ki > 1. Let k\ = ki — 1 and c\ = c(k — 1). 
Further, assume that node i has the following properties: 

1. i is a member of configuration c\, 

2. i does not fail in /3(max(i, Itime(a') + e + d) + lOd), 

3. i € J(max(i, Mime (a 1 ) + e + d) + 8d), 
4- i € J(max(i, Itime(a') + e + 2d)) ; 

5. i performs a join-ack prior to the upgrade-ready^) event. 

Let t' be a time such that t < t' < max(i, itime(a') + e + d) + 13d. Let a" be a prefix of a such 
that: 

1. t' = £time(a"), 

2. an upgrade-ready(A;2) event is in a", 

3. £state(a").upg.phase i = idle. 
Then either: 

1. £state(j3(t')).cmap(k\)i = ±, or 

2. i performs a cfg-upgrade(&')j at time t' , for some k' > ki- 

Proof. If £state(a") .cmap(k\)i = ±, then the conclusion holds, since a" is a prefix of /3(i') : 
by Lemma 7.6, £state(j3(t')).cmap(k\)i = ±. Assume, then, that £state{a").cmap(k\)i / ±. We 
examine in turn the preconditions for cfg-upgrade(&')j just after a" (from Figure 7): 

1. -i£state(a") .failedf By Part 2 of the assumption on i, we know that i does not fail in 
/3(max(i, £time(a') + e + d) + lOd). However, t' < max(i, £time(a') + e + d) + lOd, and 
thus i does not fail in /3(t'). Since a" is a prefix of /3(t'), i does not fail in a". 
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2. £ state (a"), status i = active: By Part 5 of the assumption on i we know that i performs a 
join-ack prior to the upgrade-ready(k2) event. 

3. £state(a").upg. phase i = idle: By assumption, this holds. 

4. V£ G N, £ < &2 : £state(a").cmap(£)i / _L: It suffices to show that by the point in the execution 
at which the upgrade-ready^) event occurs, node i has already learned of configuration 
C2 and all configurations with smaller indices. Let a"' be the prefix of a ending in the 
upgrade-ready(&2) event. Part (ii) of the definition of the upgrade-ready^) event guarantees 
that: for all £ < &2, for all j G members(c\) that do not fail in a"', £state(a"').cmap(£)j ^ _L. 
Notice that by Part 1 of the assumption about i, i G members(c\) and that by Part 2 of the 
assumption about i, i does not fail in a'", since £time(a"') = t < max(i,^ime(a') + e + d). 
Therefore we can conclude by part (ii) that for all £ < &2, £state(a'").cmap(£)i ^ _L. Since 
a'" is a prefix of a" (by assumption that upgrade-ready(&2) is included in a"), by Lemma 7.5 
we know that for all £ < &2, £state(a").cmap(£)i / _L, as desired. 

5. £state(a") . cmap(k2)i G C: By assumption, £state(a").cmap(k\)i / ±. Invariant 4.3 then im- 
plies that £state(a") .cmap(k2)i / ±, since k\ < k^- Part 4, above, shows that £state(a") .cmap(k2)i / 
_L, thus implying the desired result. 

6. £state(a>") . cmap(ki)i G C: By assumption, £state(a") .cmap(k\)i / ±. Part 4, above, shows 
that £state(a").cmap(k\)i / _L, since k\ < &2, thus implying the desired result. 

Since enabled events occur in zero time (by assumption), either the event becomes disabled, in which 
case £state(j3(t')).cmap(k\)i = ±, satisfying Part 1 of the conclusion, or at time t' = £time(a") a 
cfg-upgrade event for some configuration c with index k! > k^ occurs, satisfying Part 2 of the 
conclusion. □ 

The next lemma conditionally guarantees that soon after a new configuration is upgrade-ready, the 
old configuration is removed. 

Lemma 7.22 Let a be an a' '-normal execution satisfying (i) (a', e)-join-connectivity, (ii)(a' , e)- 
recon-readiness, (Hi) (a 1 , e)-upgrade-readiness, (iv) («', 2d)-recon-spacing-l, (v) (a 1 , e, 22<i)- 
configuration- viability . 

Assume that t G R-° is a time such that t > £time(a') + e + 14d. Assume that c\ is a 
configuration, and for some finite prefix a" of a, where t = £time(a"), for some node i G J(t) that 
does not fail in a" , for some index k\, £state(a").cmap(k\)i = c\. 

Also, we assume the Upgrades- Complete Hypothesis: for every cfg-upgrade(*)j event that occurs 
in a at some time t upg < t at some node j G J(meLx(t upg ,£time(a') + e + 2d)) where j does not fail in 
/3(max(i ttp9 , £time(a')+e+d)+^d), a matching cfg-upg-ack(*)j occurs by time meL-x(t upg ,£time(a>') + 
e + d) + 4d. 

Assume that an upgrade-ready(&i + 1) event occurs at time t' < t — 13d. Let k^ = k\ + 1 
and C2 = c(k2)- Then for some node i' G J(max(i',^iime(o; / ) + e + d) + 8d) that does not fail in 
/3(max(i' , £time(a') + e + d) + lOd), £state((3(meL-x(t' , £time(a') + e + d) + 8d)).cmap(ki)i/ = ±. 

Proof. We first identify a node, «', that is suitable. Then we show that i' completes an upgrade 
operation in the alotted time. 

We apply Lemma 7.20, where t = £', and therefore conclude that there exists a node i' with the 
following five properties: 
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1. * 

2. i 

3. i 

4. % 

5. i 



is a member of configuration c\, 

does not fail in /3(max(i', Itime(a') + e + d) + lOd), 

G J(max(i', ltime{a') + e + d) + 8d), 

G J(max(i', ltime{a') + e + 2d)), 

performs a join-ack prior to the upgrade-ready(&2) event. 



Notice that Part 2 and Part 3 satisfy the first two requirements for i' in the conclusion of this 
lemma. It remains to show that i' marks configuration c\ as ± at the appropriate point. 

We consider what happens at time max(i', £time(a>') + e + d). Let a"' be the prefix of a that 
is the longer of the following two prefixes: (i) /3(£time(a>') + e + d), or (ii) the shortest prefix of a 
that includes the cfg-upgrade(&2) event. Notice that £time(a>'") = meL-x(t' , £time(a') + e + d), and 
that the cfg-upgrade(A;2) event is in a'" . 

If £state(a"')) .cmap(ki)ii = ±, then the claim is immediate: Lemma 7.5 implies that istate (a"'). cmap^ < 
£state(/3(max.(t' , £time(a>') + e + d) + 8d)).cmapj/, since ltime(a>'") = max(i' , ltime(a>') + e + d) < 
max(i' , £time(a')+e+d)+8d. Therefore, if £state(a'").cmap(ki)ii = ±, then istate ((3 (msix(t' , £time(a')+ 
e + d) + 8d)).cmap(ki)ii = ±. 

We thus assume that lstate(a'").cmap(k\)ii / ±, and consider what happens at time max(i', £time(a>') + 
e + d). There are now two cases to consider: 

1. istate (a"') .upg .phase^ = idle or 

2. lstate(a>'"). upg. phase v / idle. 

Case 1: Assume that £state(a"').upg.phase i i = idle. We apply Lemma 7.21, where t = t' , t' = 
max(i' , itime(a') + e + d), a" = a"', and i' is as chosen above: 

• t' < max(t', itime(a') + e + d) < max(i', £time(a') + e + d) + 13d: immediate, 

• i' satisfies the criteria, by the properties of i' above, 

• £time(a"') = max(i', £time(a') + e + d) and upgrade-ready(&2) occurs in a"': by the way 
in which a" was chosen, 

• £state(a'").upg.phase i i = idle: by the case assumption. 
From this lemma, we conclude that either: 

1. £state(/3(max(t',£time(a') + e + d))).cmap(k\)ii = ±, or 

2. i' performs a cfg-upgrade(&')j' at time max(i',^iime(o; / ) + e + d), for some k! > k^- 

In the first case, where £state(/3(max(t',£time(a') + e + d))).cmap(k\)ii = ±, we are done: 
Lemma 7.6 implies that £state(/3(max(t' ', £time(a>') + e + d) + &d)).cmap(k\)ii = ±. Consider 
the second case, that is, i' performs a cfg-upgrade(fc')j' at time max(i',^iime(o; / ) + e + d), for 
some k' > k2- 

We then apply the Upgrades- Complete Hypothesis, where j = i' and t upg = t'; notice that: 

• i' G J(max(i', £time(a') + e + 2d)): by 4*/j property of i' , 

• i' does not fail in /3(max(i',^iime(o; / ) + e + d) + Ad): by Part 2 of the way in which i' 
was chosen, and 
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• max(£', Itime(a') + e + d) < t: £' + 13d < £, by assumption, and £time(a') + e + 14d < £, 
by assumption, and therefore max(£', £time(a') + e + d) + 13d < t. 

Therefore, by the Upgrades- Complete Hypothesis we conclude that a cfg-upg-ack(A/),;/ occurs 
by time max(£', £time(a') +e + d) +4d. Since k' > &2, then by the precondition of a cfg-upg-ack 
operation we know that £state((3(msix(t' , £time(a') + e + d) +4d).cmap(ki)i/ = ±- Lemma 7.6 
implies that £state((3(msix(t' , £time(a') + e + d) + 8d).cmap(ki)i/ = ±, as desired. 

Case 2: Assume that £state(a'").upg.phase i i / idle. For this to occur, a cfg-upgrade(A/)j' event 
must occur prior to the upgrade-ready(&2) event in a with no matching cfg-upg-ack(A/),;/ event 
prior to the upgrade-ready^) event, where k' = £state(a" ).upg. target v . Otherwise, if there 
were no ongoing upgrade operation, i' would be idle. Let t\ be the time at which this earlier 
cfg-upgrade(A/)j' operation occurs. 

We can then apply the Upgrades- Complete Hypothesis, where j = i' and t upg = t\\ notice 
that: 

• %' G J(msi-x(ti,£time(a') + e + 2d)): Lemma 7.3, applied where t = t\ and i = «', shows 
that i' E J(max(ti,£time(a') + e + 2d)). 

• i' does not fail in (3(msix(ti,£time(a') + e + d) + 4d): By Part 2 of the way in which 
i' was chosen, i' does not fail in /3(max(i' , £time(a') + e + d) + lOd). Notice that t\ < 
max(i' , £time(a') + e + d), since the earlier upgrade event occurs in a'" prior to the 
upgrade-ready(&2) event. Therefore i' does not fail in (3(msix(ti,£time(a') + e + d) +4d). 

• msL-x(ti,£time(a')+e+d) < t: Again, notice that msix(ti,£time (a') +e+d) < msix(t',£time(a') + 
e + d), since t\ < t'. Also, t' + 13d < i, by assumption, and £time(a') + e + 14d < i, by 
assumption. Therefore, msi-x(t',£time(a') + e + d) < t, implying that max(ii, £time(a') + 

e + d) < t. 

We can then conclude that a cfg-upgrade-ack(&')j' occurs in a by time max(ii, £time(a') + 
e + d) + 4d < max(i' , £time(a') + e + d) + 4d. If k' > k2, then by the precondition of the 
cfg-upgrade-ack(&') action, i' marks cmap(ki) = ±, and we are done. 

Otherwise, we apply Lemma 7.21 to show that another cfg-upgrade operation begins: let £2 
be the time at which the cfg-upgrade-ack(A/)j/ occurs and 012 be the prefix of a ending in the 

cfg-upgrade-ack(A/),;/ event. Notice that: 

• t' < max(t2,ft!me(a') + e + d): By the way in which the cfg-upgrade(A/) was chosen, it 
has to complete no earlier than t' . 

• max(t2,ft!me(a') + e + d) < max(f',ftime(a') + e + d) + 13d: Above, we showed 
that that cfg-upgrade-ack(A/)j' occurs by msix(t' , £time(a') + e + d) + 4d, that is, £2 < 
max(ti,^ime(a') +e + d) +4d < m&x(t',£time(a') + e + d) + 4d, since £1 < t'. Therefore, 
£2 < max(£', £time(a') + e + d) + 13d. Also, £time(a') + e + d < £time(a') + e + 14d. 

Then we apply Lemma 7.21 with £ = £', £' = max(£2,^i«'me(«') + e + d), a" = 012, and i' as 
chosen above: 

• £' < max(t2,ft«me(a') + e + d) < max(£', £time(a') + e + d) + 13d: as shown above, 

• i' satisfies the criteria, by the properties of i' above, 

• £time(a,2) = max(i2,ftime(«') + e + d) and upgrade-ready(A;2) occurs in a": by the way 
in which 012 was chosen and the fact that the cfg-upgrade-ack(&')j' must come after the 
upgrade-ready(&2) event, 
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• £state(a2)-upg.phase i , = idle: by the effect of the cfg-upg-ack(&')j/ event that is the last 
event in a'". 

We then conclude that either: 

1. £state((3(msix(t2,itime(a') + e + d))).cmap(k\)ii = ±, or 

2. i' performs a cfg-upgrade(&")j' at time max(i2, £time(a') + e + d), for some k" > ki- 

Again, if the first case holds, we are done: since £2 < max(t',ft!me(a') + e + d) + 8d, 
Lemma 7.6 implies that £state((3(meL-x(t' , £time(a') + e + d) + 8d)).cmap(ki)i/ = ±. There- 
fore, we can assume that the second part holds, and i' performs a cfg-upgrade(&")j' at time 
max(t2,ft!me(a') + e + d), for some k" > k^- 

Once more, we apply the Upgrades- Complete Hypothesis, where j = i' and t upg = £2; notice 
that: 

• i' G J(max(i2,ftJme(a') + e + 2d)): Recall that i' G J(msix(ti,£time(a') + e + 2d)), 
above. Since msix(ti, £time(a') + e + 2d) < max(i2,ftime(a') + e + 2d) (i.e., the upgrade 
begins before it completes), by Lemma 7.1, where t = max(ii, £time(a')+e+2d) and t' = 
msi-x(t2,£time(a')+e+2d), J(mei-x(ti, £time(a')+e+2d)) C J(msi-x(t2,£time(a')+e+2d)), 
implying that i' G J(max(i2, £time(a') + e + 2d)). 

• i' does not fail in /3(max(i2, £time(a') + e + d) + 4d): By Part 2 of the way in which 
i' was chosen, i' does not fail in /3(max(i',^iime(o; / ) + e + d) + lOd). Notice that £2 < 
max(f' \£time(a')+e+d)+4:d, as shown above. Therefore max(i2,^^me(o; / )+e+d)+4d < 
max(i', £time(a') + e + d) + 8d, and as a result i' does not fail in /3(max(i2, £time(a') + 
e + d) +4d). 

• max(i2,^i«me(o; / )+e+d) < i: Again, notice that maxfo, £time (a') +e+d) < max(i',^iime(o; / ) + 
e + d)+4d. Also, i' + 13d < i, by assumption, and £time(a')+e + d+13d < i, by assump- 
tion. Therefore, max(i', £time(a') + e + d) + 13d < t. Therefore, max(i2, £time(a') + e + 

d) < max(i', £time(a') + e + d)+4d<£ — 9d, as desired. 

We can then conclude that a cfg-upgrade-ack(&")j' occurs in a by time maxfo, £time(a') + 
e + d) + 4d < max(i' , £time(a') + e + d) + 8d. Since A;" > &2, then by the precondition 
of the cfg-upgrade-ack(&') action, i' marks cmap(ki) = ±, and Lemma 7.6 implies that 
£ state ((3(msix(t' , £time(a') + e + d) + 8d)).cmap(A;i)j/ = ±. 

a 

In the next lemma, we provide a conditional guarantee that a configuration remains viable. 

Lemma 7.23 Let a be an a' '-normal execution satisfying (i) («', e)-join-connectivity, (ii)(a' , e)- 
recon-readiness, (Hi) (a', e)-upgrade-readiness, (iv) (a', 2d)-recon-spacing-l, (v) (a', e, 22d)- 
configuration- viability . 

Assume that t G R-° is a time such that t > £time(a') + e + 14d. Assume that c\ is 
a configuration, and for some finite prefix a" of a, where t = £time(a"), for some node i G 
J(max(i,^iime(o; / ) + e + 2d)) that does not fail in a", for some indexk\, £state(a").cmap(k\)i = c\. 

Also we assume the Upgrades- Complete Hypothesis: for all cfg-upgrade(*)j events that occur in 
a at some time t upg < t at some node j G J(meLx(t upg ,£time(a') + e + 2d)) where j does not fail in 
(3(msi-x(t U pg,£time(a') + e + d) + 4:d, a matching cfg-upg-ack(*)j occurs by time max(t upg , £time(a') + 
e + d) + Ad. 
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Then there exists a read-quorum, R G read- quorums (a), and a write- quorum, W G write- quorums (c\), 
such that no node in R U W fails in j3{t + 3d). 

Proof. Let ki = k\ + 1, and let c^ = c(&2). First, consider the case where no upgrade-ready(&2) 
event occurs in a. We apply Lemma 7.18, where c = c\ and k = k\\ this implies, then, that there 
exists a read-quorum, R G read-quorums (ci), and a write-quorum, VF G write- quorums (c\) , such 
that no node in i? U VF fails in a. 

Next, consider the case where an upgrade-ready(&2) event occurs in a. Let t' be the time at 
which the upgrade-ready(&2) event occurs. We claim that upgrade-ready(&2) occurs no earlier than 
t - 13d. That is, t' + 13d > t. 

Assume, in contradiction, that t' + 13<i < t. We now apply Lemma 7.22 to conclude that there 
exists a node i' G J(max(i', £time(a>') + e + d) + 8d) that does not fail in /3(max(i', Itime(a') + e + 
d) + lOd) such that £state(/3(max(t' ,£time(a>') + e + d) + 8d)).cmap(ki)i' = ±. 

We now show that the information about configuration ci's removal is propagated from node i' 
to node i. That is, we show the following: 
Claim: £state(a").cmap(k\)i = ±. 

Proof of claim: We do this in three steps. First, we show that £state(/3(max(t' ', Itime(a') + 
e + d) + 8d)).cmapj/ is mainstream after max(i' ', itime(a') + e + d) + lOd. Second, we show that 
£ state ((3 (max(i' , £time(a>') + e + d) + 8d)).cmapj/ is mainstream after t — d. Third, we conclude that 
£state(a").cmap(ki)i = ±. 

Step 1: We already know that i' G J(max(i' , £time(a') + e + d) + 8d), and does not fail in 
/3(max(i', £time(a') + e + d) + lOd). We then apply Lemma 7.7, where t = max(i', £time(a>') + e + 
d) + 8d, and i = i'\ 

• max(i', £time(a') + e + d) + 8d > £time(a') + e: Immediate. 

• i' G J (max(i' , £time(a') + e + d) +8d + 2d): «' G J(max(i',^iime(o; / ) + e + d) + 8d), as shown 
above, therefore this follow from Lemma 7.1, where t = max(i' , £time(a>') + e + d) + 8d and 
t' = max(i', £time(a') + e + d) + lOd. 

• i' does not failin/3(max(i',^iime(o; / )+e+d)+8d+d), since?' does not failin/3(max(£',^ime(a: / )+ 
e + d) + 8d + 2d) as shown above. 

Therefore we can conclude that £state((3(meL-x(t' , £time(a') + e + d) + 8d)).cmapj/ is mainstream 
after max(i, £time(a') + e + d) + lOd. 

Step 2: We have assumed above that t' < t — 13d, that is, t' + lOd < t — d — 2d. Also, 
we have assumed that £time(a') + e + 14d < i, that is, £time(a') + e + d + lOd < t — d — 2d. 
Therefore, max(i' , £time(a') + e + d) + lOd < t — 3d. We now apply Lemma 7.14, where t = 
max(i', £time(a')+e+d) + Wd, t' = t — d, and cm = £state((3(msix(t' , £time(a')+e+d)+8d)).cmap i i: 

• e + 2d < max(£', £time{a') + e + d) + lOd, 

• max(i', £time{ol) + e + d) + lOd < t - 3d, 

• £state((3(meL-x(t' , £time(a') + e + d) + 8d)).cmapj/ is mainstream after max(i, £time(a') + e + 
d) + lOd. 

We therefore conclude that £state((3(meL-x(t' , Itime(a') + e + d) + 8d)).cmapj/ is mainstream after 
t-d. 

Step 3: Notice, then, that by assumption « G J(i) and « does not fail in /3(i — d). Therefore 
by the definition of mainstream, £state((3(meL-x(t' , £time(a') + e + d) + 8d)).cmapj/ < £state(j3(t — 
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d)).cmapj. Lemma 7.6 then implies that £state(/3(t — d)).cmap i < £state(a").cmap i , since j3{t — d) 
is a prefix of a". Therefore, £state((3(msix(t' , £time(a') + e + d) + 8d)).cmapj/ < £state(a").cmap i . 
Since £state((3(meL-x(t' , £time(a') + e + d) + 8d)).cmap(ki)i/ = ± (as shown above), this means that 
£state(a").cmap(ki)i = ±, as claimed above, concluding Step 3. 

This claim that £state(a") .cmap(k\)i = ±, though, leads to a contradiction: by assumption of 
this lemma, £state(a").cmap(k\)i = c\. Therefore, we conclude that our assumption that t' < t— 13d 
is incorrect: that is, we must have t' > t — 13d. That is, we have shown that the upgrade-ready^) 
event occurs at most 13d prior to time t. 

We now apply Lemma 7.17, where c = C2, k = &2, and t = £', to conclude that there exists a 
read-quorum, R, and a write-quorum, W, of configuration c\ such that no node in R U W fails in 
/3(max(i',^iime(tt') + e) + 16c?). Above we showed that t' + 13d > i, therefore t' + 16d > t + 3d, 
which implies that max(i' , £time(a') + e) + 16d > t + 3d. Therefore, we can conclude that there 
exists a read-quorum, R, and a write-quorum, W, of configuration c\ such that no node in R U W 
fails in /3{t + 3d). □ 

The next two lemmas claim that every configuration-upgrade operation completes soon after it 
begins, or soon after the network stabilizes. The first lemma handles the case where the upgrade 
begins before the network stabilizes, or during stabilization. The second lemma handles the general 
case, for all t. 

Lemma 7.24 Let a be an a' -normal execution satisfying: (i) (a', e)-join-connectivity, (ii) (a 1 , e)- 
recon-readiness, (Hi) (a 1 , 2d)-recon-spacing-l, (iv) («', e, 22d)-configuration-viability. 

Assume that t G M-° is a time such that t < £time(a') + e + 14d, and that a cfg-upgrade(A;).; 
occurs at time t at node i. Assume that node i G J(t) and that i does not fail in /3(max(i, £time(a>') + 

d)+4d). 

Then a cfg-upg-ack(A;),; occurs no later than time m&x(t,£time(a') + d) + 4d. 

Proof. Let 7 be the configuration-upgrade operation associated with the cfg-upgrade(A:) action. 
Lemma 7.19 shows that proving the following is sufficient to prove the lemma: for every configura- 
tion in removal-set^) there exists a read-quorum, R and a write-quorum, W, such that no node 
in R U W fail by time max(i, £time(a') + d) + 3d. 

Consider any configuration, c\ with index k\ in removal-set^). If t\ is the time at which 
configuration c(k\ + 1) is installed, configuration-viability ensures that configuration c\ does not 
fail until max(ii,^ime(c/) + e) + 22d. Notice that £time(a') + e + 22d > t + 3d, since t < 
£time(a') + e + 14d. Therefore, this guarantees that there exists a read-quorum, R, and a write- 
quorum, W for configuration c\ such that no node in R U W fails until after £time(a') + e + 22d > 
max(i, £time(a') + d) + 3d. □ 

Lemma 7.25 Let a be an a 1 -normal execution satisfying: (i)(a', e)-join-connectivity, (ii) (a 1 , e)- 
recon-readiness, (Hi) (a 1 , 2d)-recon-spacing-l, (iv) (a 1 , e, 22d)-configuration-viability. 

Assume that t G M-° is a time, and that a cfg-upgrade(/c)j occurs in a at time t at node i. 
Assume that node i G J(t) and that i does not fail in /3(max(i, £time(a') + e + d) + 4d). 

Then a cfg-upg-ack(A;),; occurs no later than time max(i, £time(a') + e + d) + 4d. 

Proof. We prove this lemma by proving a stronger statement by strong induction on the number 
of cfg-upgrade events in a: if a cfg-upgrade(*)j event occurs in a at some time t upg < t at some 
node j G J(t upg ), and j does not fail in (3(meLx(t upg , £time(a') + e + d) + 4d), then a matching 
cfg-upg-ack(*)j occurs no later than time meLx(t upg ,£time(a') + e + d) + 4d. 
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As this is strong induction, we now examine the inductive step. Consider configuration-upgrade 
7, the k + 1 st upgrade operation in a that occurs at time t upg < t at node j G J(t) that does 
not fail in /3(max(i ttp9 , £time(a') + e + d) + Ad). Assume, inductively, that if 7' is one of the 
first k upgrade operations that occurs at time t' < t at some node j' G J(t') that does not fail 
in /3(max(i' , £time(a') + e + d) + Ad), then a matching cfg-upg-ack(*) occurs no later than time 
max(i', £time(a') + e + d) + Ad. There are two cases to consider. 

Case 1: t upg < Uime(a') + e + lAd. 

Recall that the cfg-upgrade event occurs at node j G J(t upg ) where j does not fail in 
(3(msix(t upg ,£time(a') +e + d) + Ad). Lemma 7.24 shows that a cfg-upg-ack(&)j occurs by time 
max(£„ ra , £time(a') + d) + Ad < max(t upg , £time(a') + e + d) + Ad. 

Case 2: t upg > £time(a') + e + lAd. 

Lemma 7.19 shows that proving the following is sufficient to prove the lemma: for every 
configuration in removal-set^) there exists a read-quorum, R and a write-quorum, W, such 
that no node in R U W fails in (3(msix(t upg ,£time(a') + d) + 3d). Let a" be the prefix of a 
ending with the cfg-upgrade event 7. Fix some configuration c G removal-set^) with index 
k; that is, £state(a").cmap(k)j = c. We now apply Lemma 7.23, where c\ = c, k\ = k, a" is 
as just defined, and t = t upg : 

• tupg > £time(a") + e + lAd. 

• tupg = £time(a"). 

• £state(a") .cmap (k) j = c, since c G removal-set^) and a" is the execution ending with 
the event 7. 

• j G J(vL\sx(t upg ,£time(a l ) + e + 2d)), since j G J(t upg ) and i up9 > £time(a') + e + 14<i. 

• Upgrades- Complete Hypothesis: for every cfg-upgrade(*)j event that occurs in a at 
some time t' < t upg at some node j' G J(msi-x(t upg ,£time(a') + e + 2d)) where j' does 
not fail in (3(msix(t upg , £time(a') + e + d) + Ad), a matching cfg-upgrade^/ occurs by time 
msi-x(t upg ,£time(a') + e + d) + 4d: this is the inductive hypothesis, since any cfg-upgrade 
occuring at time t' < t upg must be one of the first k upgrade events. 

Therefore, we conclude that there exists a read-quorum, R G read- quorums (c) , and a write- 
quorum, W G write- quorums (c), such that no node in R U W fails in /3(t + 3d). Since this is 
true for all c G removal-set^), this then shows the desired result. 

□ 

We next present two corollaries that follow from these lemmas. First, we present the unconditional 
version of Lemma 7.23: 

Corollary 7.26 Let a be an a' -normal execution satisfying (i) (a 1 , e)-join-connectivity, (ii)(a', e)- 
recon-readiness, (Hi) (a', 2d)-recon-spacing-l, (iv) (a 1 , e, 22d)-configuration-viability. 

Assume that t G R-° is a time. Assume that c is a configuration, and for some finite prefix 
a" of a where t = £time(a"), some node i G J(t) that does not fail in a", for some index k, 
£state(a").cmap(k)i = c. 

Then there exists a read-quorum, R, and a write-quorum, W, such that no node in RU W fails 
in /3(max(i, £time(a') + e + d) + 3d). 
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Proof. If t > £time(a>') + e + 14d, then we show that the result follows from Lemma 7.25 and 
Lemma 7.23. We apply Lemma 7.25 where c\ = c, k\ = k: notice that Lemma 7.23 assumes that: 

• t > £time(a') + e + 14d: By assumption. 

• t = £time(a"): By assumption. 

• £state(a") . cmap(k)i = c: By assumption. 

• i G J(m&x(t,£time(a') + e + 2d)): t > £time(a') + e + 14d. 

• i does not fail in a": By assumption. 

• Upgrade- Completes Hypothesis: Fix some cfg-upgrade(*)j event that occurs at time t upg < t 
at node j G J(msi-x(t upg ,£time(a') + e + 2d) where j does not fail in (3(msix(t upg ,£time(a') + 
e + d) + Ad). We apply Lemma 7.25, where t = t upg and i = j. (Notice that j G J(t upg ) by 
Lemma 7.1.) We therefore conclude that a cfg-upgrade(*)j occurs no later than max(i ttp9 , £time(a') + 
e + d) + 4d, as required by the conclusion of the Upgrade- Completes Hypothesis. 

We thus conclude that there exists a read-quorum, R G read- quorums (c) and a write-quorum, 
W G write- quorums (c) such that no node in R U W fails in (3(t + 3d). Since t > £time(a') + e + 14d, 
this implies that no node in R U W fails in /3(max(i, £time(a') + e + d) + 3d). 

Alternatively, if t < £time(a') + e + 14d, configuration- viability guarantees that there exists a 
read-quorum, R, and a write-quorum, W, such that no node in R(JW fails in /? (£time(a') + e + 22d) , 
and again the result follows. □ 

The second corollary guarantees the liveness of the system; that is, the following corollary shows 
that read and write operations always terminate eventually: 

Corollary 7.27 Let a be an a' -normal execution satisfying (i) (a 1 , e)-join-connectivity, (ii)(a', e)- 
recon-readiness, (Hi) (a 1 , 2d)-recon-spacing-l, (iv) (a 1 , e, 22d)-configuration-viability. 

Assume that t G R-°. Assume that at time t, for some i G J(t) that does not fail in a 7 , a readj 
or writer occurs in a. Then the operation eventually completes. 

Proof. The read or write operation completes if each phase of the operation completes. Let ip be 
the readj, write^, query-fix^, or recvj action that sets op.cmap to cmap, beginning the phase. Each 
phase completes when for all £ : op.cmap(£)i G C, i has sent a gossip message to an appropriate 
quorum of nodes in c(£), and received a response. The only way an operation can fail to terminate, 
then, is if there does not exist a non-failed read-quorum or a write-quorum of some configuration 
in op.cmap. 

Assume that c is a configuration with index k such that op.cmap(k)i is set to c at some 
time t' after </;, and before the phase completes. Then for some a" where t' = £time(a"), 
£state(a").cmap(k)i = c, since op.cmap is set by copying a truncated version of cmap^ By Corol- 
lary 7.26, there exists a read-quorum, R, and a write-quorum, W, such that no node in R\J W fails 
in (3(msix(t, £time(a') + e + d) + 3d). No later than time max(i, £time(a>') + e + d) + d, node i sends a 
gossip message to every node in RUW. By time max(i, £time(a')+e+d)+2d the message is received 
by every node in R\J W, and each node sends a response to i. By time max(i, £time(a>') +e + d) +3d, 
node i receives the response, and R\J W C ace. Therefore, for all configurations the read and write 
quorums survive long enough, and so the phase completes. □ 



7 More specifically, we are assuming that i does not fail until after the operation terminates; since we do not here 
bound how long the operation may take, we instead assume that i does not fail in a. Obviously i failing after the 
operation completes has no effect on the operation completing. 
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7.8 Read-Write Latency Results 

In this section we state and prove the main result of the latency analysis: if an execution contains 
a period of time of good behavior, and if this section of the executions is 22d-configuration- viable, 
then all read and write operations terminate, and terminate within 8d. Notice that in the original 
Rambo paper, a similar result required the stronger assumption of oo- configuration-viability, an 
arbitrarily unbounded failure assumption, depending on events earlier in the execution. Here there 
is no dependency on earlier events: the algorithm is guaranteed to stabilize rapidly, as soon as the 
network stabilizes. 

We need one more lemma. This lemma shows that once a report(c) action occurs for some 
configuration with index k, then soon every node has set cmap(£) ^ _L, for all £ < k. This will 
allow us to show that if a read or write operation begins long enough after a certain report(c) 
operation, then it cannot be interrupted by learning about new configurations with smaller indices. 

Lemma 7.28 Let a be an a' -normal execution satisfying: (i) («,e)-join-connectivity, (ii) (a', e)- 
recon-readiness, (Hi) 6d-recon-spacing, (iv) (a 1 , e, 4d) -configuration-viability . 

Assume that a contains decide events for infinitely many configurations. Let £ be a configuration 
index. Let c\ be the configuration with index £, and c^ be the configuration with index £ + 1. 

Let i be the node at which the first recon(ci, C2) event, n, occurs. Let t be the time at which the 
report(ci)j event, <j>, occurs. 

Then there exists a CMap, cm, such that: 

1. cm(£) j^z _|_ ; and 

2. cm is mainstream after max(i, £time(a') + e + d) + 6d. 

Proof. There are two cases to consider. In each case, we first demonstrate an appropriate cm: 
we identify a node that performs a report(ci) and does not fail too soon. We then show that the 
cmap of that node is mainstream after max(i, £time(a') + e + d) + 6d. 

Case 1: recon(ci, c-i)i occurs at some time < £time(a') + e + 2d. 

In this case, we use the Recon-Spacing-2 assumption to identify a node with an appropriate 
cmap, and then use configuration-viability to show that this node survives long enough for 
its cmap to become mainstream after £time(a') + e + 4d, which then allows us to show that 
its cmap is mainstream after max(i, £time(a') + e + d) + 6d. 

By the Recon-Spacing-2 assumption, there exists a write-quorum, W G write- quorums (c\) , 
such that for every node j G W, a report(ci)j occurs in a prior to n, the recon event that 
proposes configuration ci- By configuration- viability, there exists some node j G W that does 
not fail by time £time(a') +e + 4d, since there exists some read-quorum, R, that does not fail 
by time £time(a') + e + 4d, and by assumption R n W / 0. 

We now show that cmapj satisfies Property 1 after £time(a') + e + 2d. Notice that: 

£state((3(time(ir))).cmap(£)j ^ _L, 

since the report action notifies j of the configuration c\ prior to n. By assumption we know that 
time(n) < £time(a')+e+2d. Therefore we know that £state(j3(£time(a')+e+2d)).cmapj 7^ _L. 

Let cm, = £state(j3(£time(a l ) + e + 2d)). cmap j. We know, then, that cm(£) / _L, as desired. 

Next we show that cm is mainstream after £time(a') + e + 4d. We apply Lemma 7.7, where 
i = j, t = £time(a') + e + 2d: 
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• j G J(£time(a') + e + 4d): If £ = 0, then j = i$ and we have, by assumption, that «o 
performs a join-ackj at time 0, immediately implying the statement by the definition of 
J. 

Otherwise, we apply Lemma 7.2, where h = a, t' = time(recon(c(£ — l),ci)), and 
t = £time(a') + e + 2d. Notice that £time(a') + e + 2d > time(recon(c(£ — 1), c\)) since 
£time(a') + e + 2d > time(ir), and time(ir) > time(recon(c(£ — l),ci)). We therefore 
conclude that members(ci) C J(£time(a') + e + 2d). In particular, this means that 
j G J(£time(a') + e + 2d). Next we apply Lemma 7.1, where t = £time(a') + e + 2d and 
t' = £time(a') + e + Ad to see that j G J(£time(a') + e + Ad). 

• £time(a') + e + 2d> £time(a') + e: Immediate. 

• j does not fail in (3(£time(a') + e + 3d): as shown above j does not fail in (3(£time(a') + 
e + 4c?) , by choice of j and configuration- viability. 

We then conclude, since cm = £state((3(£time(a>') + e + 2d)).cmapj, that cm is mainstream 
after £time(a') + e + 4gL 

We next apply Lemma 7.14, where t = £time(a') + e + 4d, t' = max(i, £time(a') + e + d) + 6d, 
and cm is as defined above: 

• e + 2d < £time(a') + e + 4d: Immediate. 

• £time(a') + e + Ad < max(i, £time(a') + e + d) + 6d — 2d: Immediate. 

• cm is mainstream after £time(a') + e + 4d: As shown above. 

Therefore, we conclude that cm is mainstream after max(i, £time(a') + e + d) + 6d, as desired. 

Case 2: recon(ci, C2)i occurs at some time > £time(a') + e + 2c?. 

We first notice that i has been notified of configuration c\ and then show that the cmap of i 
is mainstream after max(i, £time(a') + e + d) + 6d. 

Notice that £state((3(t)).cmap(£)i / _L, since the report(ci)j event notifies i of configuration 
c\. 

We now apply Lemma 7.7, where i is as defined above and t = max(i, £time(a') + e + d), to 
show that cm is mainstream after msi-x(t,£time(a') + e + d) + 2d: 

• max(i, £time(a') + e + d) +2d> £time(a') + e: Immediate. 

• i 6 J(max(i, £time(a') + e + d) + 2d): We apply Lemma 7.2, where /t = c\, t' is the 
time at which c\ is proposed, and t = max(i, £time(a') + e + d) + 2d. Notice that 
max(i, £time(a') + e + d) + 2d is no earlier than the time at which c\ is proposed, since a 
report(ci) occurs prior to msix(t,£time(a') + e + d) + 2d. Also, msi-x(t,£time(a') + e + d) + 
2d> £time(a')+e+2d. Therefore we conclude that members(c\) C J(max(i, £time(a') + 
e + d) + 2d). This implies that « G J(max(t,^i?me(«') + e + d) + 2d). 

• « does not fail in /3(max(i, £time(a') + e + d) + d): We know that i does not fail prior to 
event 7r, that is, the recon(ci, C2)i event. By Recon- Spacing- 1, we know that time(ir) > 
t + 6d. By assumption of this case, we know that time(n) > £time(a') + e + 2d. Therefore 
i does not fail in /3(max(i, £time(a') + e + d) + d). 

We therefore conclude that cm is mainstream after max(i, £time(a') + e + d) + 2d. 

We next apply Lemma 7.14, where t = max(t, £time(a') + e + d) + 2d, t' = max(i, £time(a') + 
e + d) + 6d, and cm is as defined above: 
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• e + 2d < max(£, Itime(a') + e + d) + 2d: Immediate. 

• msi-x(t,£time(a') + e + d) + 2d < meL-x(t,£time(a') + e + d) + 6d — 2d: Immediate. 

• cm is mainstream after time^i): As shown above. 

Therefore, we conclude that cm is mainstream after max(£, itime(a') + e + d) + 6d, as desired. 

□ 

We finally prove the main theorem, showing that read and write operations terminate rapidly. 
This result requires 12d+e-recon-spacing, and is similar to Theorem 8.17 from [13]. This earlier 
theorem states that in a normal, steady-state case, with good environmental behavior, read and 
write operations terminate within time 8d. Although the following theorem allows for more compli- 
cated behavior, deviating from the assumption of good environmental assumptions, read and write 
operations still complete rapidly. 

Theorem 7.29 Let a be an a' -normal execution satisfying: (i) (o;,e)-join-connectivity, (ii) (a', e)- 
recon-readiness, (Hi) 12d+e-recon-spacing, (iv) (a', e, 22d) - configuration-viability . 

Let t > £time(a>') + e + 17d, and assume a read or write operation starts at time t at some node 
i. Assume i € J(t + 8d), and does not fail until the read or write operation completes. Also, assume 
that a contains decide events for infinitely many configurations. Then node i completes the read or 
write operation by time t + 8d. 

Proof. Let Co, c\, C2, . . . denote the infinite sequence of successive configurations decided upon in 
a; by infinite reconfiguration, this sequence exists. For each k > 0, let 7T& be the first recon(c^, c& + i)* 
event in a, let «& be the location at which this occurs, and let ^k be the corresponding, preced- 
ing report(c^)j A , event. (The special case of this notation for k = is consistent with our usage 
elsewhere.) 

We show that the time for each phase of the read or write operation is < 4d - this will yield the 
bound we need. Consider one of the two phases, and let ip be the readj, write^ or query-fix^ event 
that begins the phase. 

We claim that time(ip) > time(^o) + 8d, that is, that tp occurs more than 8d time after the 
report(0).; event: We have that time(ip) > t, and t > iime(join-ackj) + 8d by assumption that 
i € J(t + 8d). Also, iime(join-ackJ > fime(join-ackj o ). Furthermore, i«'me(join-ackj o ) > time(^o), 
that is, when join-ackj occurs, report(0)j occurs with no time passage. Putting these inequalities 
together we see that time(ip) > time((fio) + 8d. 

Fix k to be the largest number such that time(ip) > time(^k) +8d. The claim in the preceding 
paragraph shows that such k exists. 

Next, we show that within zero time of ip occurring, cmap(t)i ^ _L for all £ < k. It is at this 
point that the proof diverges from that of Lemma 8.17 from [12]. 

For the purposes of the next two lemmas, fix any £ < k. We apply Lemma 7.28, where £ is as 
fixed above, t = time((pi), cp = (f>i, n = ni, c\ = Q,and i = it. We therefore conclude that there 
exists a CMap cm such that: 

1. cm(£) ^ _L, and 

2. cm is mainstream after max(£ime(<^), £time(a') + e + d) + 6d. 

We next apply Lemma 7.14, where t = max(£ime(<^) , £time(a') + e + d) + 6d, t' = time(ip), and 
cm is as above, to show that cm is mainstream after time(ip): 
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• e + 2d < max(£ime(<^) , £time(a') + e + d) + 6d: Immediate. 

• max (fime (</>.£ ) , £time(a') + e + d) + 6d < time(ip) — 2d: By the way in which k is chosen we 
know that time((pk) + 8d < time(ip). Also, time((pi) < time((fik): either £ = k, or fa precedes 
'Kg which precedes fa. By assumption we know that £time(a') + e + 8d < t, and t < time(ip). 

• cm is mainstream after max(£ime(<^) , £time(a') + e) + 6d: As shown above. 

Therefore, we conclude that cm is mainstream after time(ip). We know that i £ J(t), and £ < 
time(ip), so by Lemma 7.1, « € J(time(ip)). Also, « does not fail until the read or write operation 
completes, and therefore either the read or write operation completes at time(ip) (in which case we 
have proved the desired bound) or i does not fail in /3(time(ip)). Therefore by definition of a CMap 
being mainstream, if cm is mainstream after time(ip), then cm < £state(/3(time(ip))).cmap i . 

Having shown this for fixed £ < k, we now know that for all £ < k there exists some CMap, 
cm, such that cm(£) / _L and cm is mainstream after time(ip), this implies that for all £ < k, 
£state((3(time(ip))).cmap(£)i ^ _L. Therefore we have shown that within zero time of tp occurring, 
cmap(£)i / _L for all £ < k. 

Now, by choice of k, we know that time(ip) < time^k+i) + 8d. The Recon-Spacing condition 
implies that time^k+i) (the first recon event that requests the creation of the (k + 2) nd configura- 
tion) is > time((pk+i) + 12d. Therefore, for an interval of time of length > 4d after ip, the largest 
index of any configuration that appears anywhere in the system is k + 1. This implies that the 
phase of the read or write operation that starts with tp completes with at most one additional delay 
(of 2d) for learning about a new configuration. This yields a total time of at most 4d for the phase, 
as claimed. 

Finally, by Corollary 7.27, the operation eventually terminates, which guarantees that ever 
configuration in op.cmap remains viable for long enough. □ 

This shows that assuming (a', e, 22d)- configuration-viability is sufficient to guarantee that read 
and write operations terminate quickly. As long as the reconfiguration algorithm can guarantee 
this level of viability, the RAMBO II algorithm will continue to make progress, regardless of any bad 
behavior the network may experience. Further, while 22d may seem a long period of time to ensure 
viability, it must be remembered that d is typically a small interval: we have been assuming that 
d is a single message delay in the network. Note that simply deciding on a new configuration to 
install might take many intervals of d (in [12], it is bounded by lid). Also, this 22d bound is fairly 
conservative: by making stronger assumptions as to who begins configuration-upgrade operations, 
and how gossip messages propagate information about completed configuration-upgrade operations, 
it is probably possible to improve this bound. In this paper we are primarily interested in the fact 
that it is a constant time bound. 

8 Implementation and Preliminary Evaluation 

Musial and Shvartsman [16] have developed a prototype distributed implementation that incor- 
porates both the original Rambo configuration management algorithm [12] and the new Rambo 
II algorithm presented in this paper. The system was developed by manually translating the In- 
put/Output Automata specification to Java code. To mitigate the introduction of errors during 
translation, the implementers followed a set of precise rules, similar to [2], that guided the deriva- 
tion of Java code from Input/Output Automata notation. The system is undergoing refinement and 
tuning, however an initial evaluation of the performance of the two algorithms has been performed 
in a local- area setting. 
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Figure 21: Preliminary empirical evaluation of the average operation latency (measured as the 
number of gossip intervals), as a function of reconfiguration frequency, measured as number of 
reconfigurations per one reconfiguration period. 

The platform consists of a Beowulf cluster with 13 machines running Linux (Red Hat 7.1). 
The machines are Pentium processors in the range from 90 MHz to 900 MHz, interconnected via 
a 100 Mbps Ethernet switch. The implementation of the two algorithms shares most of the code 
and all low-level routines. Any difference in performance is traceable to the distinct configuration 
management discipline used by each algorithm. 

The machines vary significantly in speed. Given several very slow machines, Musial and Shvarts- 
man do not evaluate absolute performance and instead focus initially on comparing the two algo- 
rithms. 

The preliminary results in Figure 21 show the average latency of read/write operations as the 
frequency of reconfigurations grows from about two to twenty reconfigurations per one gossip pe- 
riod. In order to handle such frequent reconfigurations, a large gossip interval (8 seconds) is used. 
This interval is much larger than the round-trip message delay, thus reducing the effects of net- 
work congestion encountered when reconfiguring very frequently. The results show that the overall 
latency of read/write operations for the new algorithm progressively improve, as the frequency 
of reconfiguration increases. As expected, the decrease in latency becomes substantial for bursty 
reconfigurations (at 20 reconfigurations per gossip interval). For less frequent reconfigurations the 
latency is similar, at about 4 gossip intervals depending on the settings (not shown). This is ex- 
pected and consistent with our analysis, since the two algorithms are essentially identical when 
cmaps contain one or two configurations. Figure 22 shows the average number of configurations 
in cmaps as a function of reconfiguration frequency. This further explains the difference in perfor- 
mance, since the average number of configurations in cmaps is lower in the new algorithm as the 
frequency of reconfigurations increases. 

Finally notice that the modest number of machines used in this study favored the original algo- 
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Figure 22: Preliminary empirical evaluation of the average number of configurations in cmap's, as 
a function of reconfiguration frequency, measured as number of reconfigurations per one reconfigu- 
ration period. 

rithm. This is because the machines are often members of multiple configurations, thus the number 
of messages needed to reach fixed-points by the read/write operations of the original algorithm is 
much lower than is expected when each processor is a member of a few configurations. 

Also, notice that this evaluation does not examine the effects of message loss and lack of network 
connectivity. We hypothesize that, as in the case of frequent bursty reconfiguration, when there 
are intervals of time in which the network is disconnected, the new algorithm should recover more 
rapidly. This testing has not yet been performed. 

Full performance evaluation is currently in progress. Shvartsman and Musial are investigating 
how the performance depends on the number of machines and various timing parameters. 

9 Conclusion and Open Problems 

In this paper we have presented a new algorithm, improving on the original RAMBO algorithm 
by Lynch and Shvartsman [12, 13]. While the original RAMBO algorithm is analyzed primarily in 
the context of good network behavior, we are able to show that our new algorithm functions well 
even when the network experiences transient periods of bad behavior, including message loss, clock 
skews, and arbitrary asynchrony, and when reconfiguration is bursty and uneven. 

The key to this improvement is a new rapid configuration-upgrade mechanism, which allows 
the system to stabilize rapidly after a period of bad network behavior. In the previous Rambo 
algorithm, it might take arbitrarily long to recover from a period of bad behavior. In this new 
algorithm, however, within a constant time, the system returns to a steady-state condition. This 
allows the algorithm to function more reliably in a long-running, dynamic system: when a system 
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is expected to function for months and years without failure, it is necessary to rapidly recover from 
the inevitable transient network failures. 

This improvement also makes practical the design of algorithms to choose new configurations. 
In the earlier version of RAMBO, it is unclear what properties a reconfiguration algorithm must 
support in order for it to be useful. This paper shows that a reconfiguration automaton must 
provide exactly (a' ,22d)- configuration-viability . 

To design such a reconfiguration algorithm, then, is one of the major open problems posed by 
this paper. In particular, it seems important to show that if the rate of failure is bounded, then the 
algorithm continues to make progress. This is similar to the ideas introduced by Karger and Liben- 
Nowell in [10], in which they assume that the system has a bounded half-life: the time in which 
either half the processes fail or the number of active processes doubles. Under this assumption, 
they show that their algorithm operates correctly. 

By similarly assuming a bounded rate of failures, it should be possible in certain cases to design 
a reconfiguration algorithm that guarantees liveness by initiating reconfiguration with some min- 
imum frequency. By choosing appropriate quorums and appropriate numbers of reconfigurations, 
(a', 22d)- configuration-viability should be possible. 

Other open problems include improving the join protocol, and designing a leave protocol to 
allow good detection of nodes that have exited the system. Currently, the join protocol is quite 
simple and it would seem beneficial to require more communication before allowing a node to 
initiate operations. And when nodes fail or leave, in the algorithm as stated, they are just ignored. 
By introducing a formal protocol to leave the system, and a method for detecting failed nodes, it 
might be possible to improve the long-run performance of the system. 

Another open problem is to determine how to recover when viability fails (and data is inevitably 
lost). More generally, is a self-stabilizing version of Rambo feasible? It would also be interesting 
to determine whether a version of Rambo could be adapted to tolerate Byzantine faults. 

RAMBO may also allow the construction of other data types, such as weakly consistent memory 
and sets. It may also be possible to optimize Rambo to return read values more rapidly, in one 
phase, in certain cases. An important question would be to determine the most powerful data 
object that can be implemented using the Rambo technique; one suspects that it is impossible to 
implement consensus in this manner. 

Finally, it would be interesting to examine how the Rambo algorithm could be adapted to 
specific platforms. The algorithm is presented in a fairly abstract fashion. In real implementations, 
it would be optimized depending on the target platform. In particular, we suspect that Rambo 
should work well in sensor networks, mobile- networks, and peer-to-peer networks. 

In conclusion, this paper has presented a new algorithm for atomic memory in a highly dynamic 
environment, proved that is always correct, and presented a set of conditions that guarantee liveness. 
This provides significant improvements over existing algorithms, rapidly recovering from transient 
network problems and bursty reconfiguration. 
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